With growing population, urbanization and industrialization, particulate pollution is increasingly becoming a biggest concern.As the particulate pollution increases there will be higher level of unhealthy air. Particulate matters are inhalable by humans and can induce various cardiovascular diseases.As various researches indicate other major pollutants and weather could be a precursor for PM2.5 pollution, this data analysis with given air quality data and weather aims at understanding the 2.5 microns particulate matter relation to other major pollutants and their concentration during different seasons in India for major cities. Predicting PM2.5 can help in taking control actions interms of better solid waster management, industrial wastes, increasing the green cover etc. The study will undertake following objectives
knit_print.data.frame = function(x, ...) {
res = rmarkdown::paged_table(x)
rmarkdown:::knit_print.data.frame(res)
}
registerS3method(
"knit_print", "data.frame", knit_print.data.frame,
envir = asNamespace("knitr")
)
The preliminary analysis involves loading, inspection and cleaning of the Air quality and weather datasets. As part of this activity, primary inspection conducted on the data set to understand the different variables, types and missing information based on visual and numerical analysis. Data cleaning is performed to remove unusable data entries/rows with incomplete cases or missing data.Further various operations performed to create clean, reduced or combined data set for further analysis.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(ggplot2)
library(stringr)
library(lubridate)
library(leaflet)
library(lazyeval)
library(mosaic)
## Registered S3 method overwritten by 'mosaic':
## method from
## fortify.SpatialPolygonsDataFrame ggplot2
##
## The 'mosaic' package masks several functions from core packages in order to add
## additional features. The original behavior of these functions should not be affected by this.
##
## Attaching package: 'mosaic'
## The following object is masked from 'package:Matrix':
##
## mean
## The following object is masked from 'package:ggplot2':
##
## stat
## The following objects are masked from 'package:dplyr':
##
## count, do, tally
## The following objects are masked from 'package:stats':
##
## IQR, binom.test, cor, cor.test, cov, fivenum, median, prop.test,
## quantile, sd, t.test, var
## The following objects are masked from 'package:base':
##
## max, mean, min, prod, range, sample, sum
library(mosaicData)
install.packages("Datasets/statisticalModeling_0.3.0.tar.gz", repos = NULL, type = "source")
## Installing package into '/home/2023MCS120010/R/x86_64-pc-linux-gnu-library/4.3'
## (as 'lib' is unspecified)
install.packages("Datasets/rpart.plot_3.1.1.tar.gz", repos = NULL, type = "source")
## Installing package into '/home/2023MCS120010/R/x86_64-pc-linux-gnu-library/4.3'
## (as 'lib' is unspecified)
library(statisticalModeling)
##
## Attaching package: 'statisticalModeling'
## The following objects are masked from 'package:ggformula':
##
## gf_abline, gf_bar, gf_boxplot, gf_counts, gf_density,
## gf_density_2d, gf_frame, gf_freqpoly, gf_hex, gf_histogram,
## gf_hline, gf_jitter, gf_line, gf_path, gf_point, gf_text
library(rpart)
library(rpart.plot)
city_day<-read.csv("./Datasets/Air_Quality/city_day.csv")
city_hour<-read.csv("./Datasets/Air_Quality/city_hour.csv")
station_day<-read.csv("./Datasets/Air_Quality/station_day.csv")
station_hour<-read.csv("./Datasets/Air_Quality/station_hour.csv")
stations<-read.csv("./Datasets/Air_Quality/stations.csv")
blr_wthr<-read.csv("./Datasets/Temperature_And_Precipitation_Cities_IN/Bangalore_1990_2022_BangaloreCity.csv")
chn_wthr<-read.csv("./Datasets/Temperature_And_Precipitation_Cities_IN/Chennai_1990_2022_Madras.csv")
dlhi_wthr<-read.csv("./Datasets/Temperature_And_Precipitation_Cities_IN/Delhi_NCR_1990_2022_Safdarjung.csv")
lucknw_wthr<-read.csv("./Datasets/Temperature_And_Precipitation_Cities_IN/Lucknow_1990_2022.csv")
mum_wthr<-read.csv("./Datasets/Temperature_And_Precipitation_Cities_IN/Mumbai_1990_2022_Santacruz.csv")
geo_data<-read.csv("./Datasets/Temperature_And_Precipitation_Cities_IN/Station_GeoLocation_Longitute_Latitude_Elevation_EPSG_4326.csv")
bhu_wthr<-read.csv("./Datasets/Temperature_And_Precipitation_Cities_IN/weather_Bhubhneshwar_1990_2022.csv")
rourkela_wthr<-read.csv("./Datasets/Temperature_And_Precipitation_Cities_IN/weather_Rourkela_2021_2022.csv")
rajas_wthr<-read.csv("./Datasets/Temperature_And_Precipitation_Cities_IN/Rajasthan_1990_2022_Jodhpur.csv")
#city_day
head(city_day)
summary(city_day)
## City Date PM2.5 PM10
## Length:29531 Length:29531 Min. : 0.04 Min. : 0.01
## Class :character Class :character 1st Qu.: 28.82 1st Qu.: 56.26
## Mode :character Mode :character Median : 48.57 Median : 95.68
## Mean : 67.45 Mean : 118.13
## 3rd Qu.: 80.59 3rd Qu.: 149.75
## Max. :949.99 Max. :1000.00
## NA's :4598 NA's :11140
## NO NO2 NOx NH3
## Min. : 0.02 Min. : 0.01 Min. : 0.00 Min. : 0.01
## 1st Qu.: 5.63 1st Qu.: 11.75 1st Qu.: 12.82 1st Qu.: 8.58
## Median : 9.89 Median : 21.69 Median : 23.52 Median : 15.85
## Mean : 17.57 Mean : 28.56 Mean : 32.31 Mean : 23.48
## 3rd Qu.: 19.95 3rd Qu.: 37.62 3rd Qu.: 40.13 3rd Qu.: 30.02
## Max. :390.68 Max. :362.21 Max. :467.63 Max. :352.89
## NA's :3582 NA's :3585 NA's :4185 NA's :10328
## CO SO2 O3 Benzene
## Min. : 0.000 Min. : 0.01 Min. : 0.01 Min. : 0.000
## 1st Qu.: 0.510 1st Qu.: 5.67 1st Qu.: 18.86 1st Qu.: 0.120
## Median : 0.890 Median : 9.16 Median : 30.84 Median : 1.070
## Mean : 2.249 Mean : 14.53 Mean : 34.49 Mean : 3.281
## 3rd Qu.: 1.450 3rd Qu.: 15.22 3rd Qu.: 45.57 3rd Qu.: 3.080
## Max. :175.810 Max. :193.86 Max. :257.73 Max. :455.030
## NA's :2059 NA's :3854 NA's :4022 NA's :5623
## Toluene Xylene AQI AQI_Bucket
## Min. : 0.000 Min. : 0.00 Min. : 13.0 Length:29531
## 1st Qu.: 0.600 1st Qu.: 0.14 1st Qu.: 81.0 Class :character
## Median : 2.970 Median : 0.98 Median : 118.0 Mode :character
## Mean : 8.701 Mean : 3.07 Mean : 166.5
## 3rd Qu.: 9.150 3rd Qu.: 3.35 3rd Qu.: 208.0
## Max. :454.850 Max. :170.37 Max. :2049.0
## NA's :8041 NA's :18109 NA's :4681
#glimpse(city_day)
#str(city_day)
nrow(city_day)
## [1] 29531
ncol(city_day)
## [1] 16
city_day <- replace(city_day, city_day=="", NA)
cd_na_total_per <-data.frame(colname = names(city_day),Total_NAs = colSums(is.na(city_day)), "PerctOfNAs" = colSums((is.na(city_day)/sum(is.na(city_day)) * 100)))
cd_na_total_per
ggplot(cd_na_total_per, aes(x=colname, y=PerctOfNAs))+geom_col()
maxna <-max(rowSums(is.na(city_day)))
cd_rowswithmaxna<-subset(city_day, rowSums(is.na(city_day)) == maxna)
#cd_rowswithmorena
head(cd_rowswithmaxna)
city_day<-subset(city_day, rowSums(is.na(city_day)) != maxna)
#city_day
city_day[58,]
sum(is.na(city_day[58,]))
## [1] 2
cd_rowswithonlyna<-sum(rowSums(is.na(city_day)) == 14)
cd_rowswithonlyna
## [1] 0
head(city_day)
city_day <- replace(city_day, city_day=="", NA)%>%drop_na(AQI_Bucket)
city_day$AQI_Bucket <- as.factor(city_day$AQI_Bucket)
city_day$Date <- as.Date(city_day$Date)
#str(city_day)
summary(city_day)
## City Date PM2.5 PM10
## Length:24850 Min. :2015-01-01 Min. : 0.04 Min. : 0.03
## Class :character 1st Qu.:2017-08-16 1st Qu.: 29.00 1st Qu.: 56.78
## Mode :character Median :2018-11-05 Median : 48.78 Median : 96.18
## Mean :2018-07-24 Mean : 67.48 Mean :118.45
## 3rd Qu.:2019-10-11 3rd Qu.: 80.92 3rd Qu.:150.18
## Max. :2020-07-01 Max. :914.94 Max. :917.08
## NA's :678 NA's :7086
## NO NO2 NOx NH3
## Min. : 0.03 Min. : 0.01 Min. : 0.00 Min. : 0.01
## 1st Qu.: 5.66 1st Qu.: 11.94 1st Qu.: 13.11 1st Qu.: 8.96
## Median : 9.91 Median : 22.10 Median : 23.68 Median : 16.31
## Mean : 17.62 Mean : 28.98 Mean : 32.29 Mean : 23.85
## 3rd Qu.: 20.03 3rd Qu.: 38.24 3rd Qu.: 40.17 3rd Qu.: 30.36
## Max. :390.68 Max. :362.21 Max. :378.24 Max. :352.89
## NA's :387 NA's :391 NA's :1857 NA's :6536
## CO SO2 O3 Benzene
## Min. : 0.000 Min. : 0.01 Min. : 0.01 Min. : 0.000
## 1st Qu.: 0.590 1st Qu.: 5.73 1st Qu.: 19.25 1st Qu.: 0.230
## Median : 0.930 Median : 9.22 Median : 31.25 Median : 1.290
## Mean : 2.345 Mean : 14.36 Mean : 34.91 Mean : 3.459
## 3rd Qu.: 1.480 3rd Qu.: 15.14 3rd Qu.: 46.08 3rd Qu.: 3.340
## Max. :175.810 Max. :186.08 Max. :257.73 Max. :455.030
## NA's :445 NA's :605 NA's :807 NA's :3535
## Toluene Xylene AQI AQI_Bucket
## Min. : 0.000 Min. : 0.000 Min. : 13.0 Good :1341
## 1st Qu.: 1.028 1st Qu.: 0.390 1st Qu.: 81.0 Moderate :8829
## Median : 3.575 Median : 1.420 Median : 118.0 Poor :2781
## Mean : 9.526 Mean : 3.589 Mean : 166.5 Satisfactory:8224
## 3rd Qu.: 10.180 3rd Qu.: 4.120 3rd Qu.: 208.0 Severe :1338
## Max. :454.850 Max. :170.370 Max. :2049.0 Very Poor :2337
## NA's :5826 NA's :15372
dropcol<-c("PM10", "Toluene", "Benzene", "Xylene")#, "NH3")
#city_day<-city_day[,dropcol,drop=FALSE]
city_day<-subset(city_day, select=-c(PM10, Toluene, Benzene, Xylene))#, NH3))
head(city_day)
With the initial glimpse on the city air quality data the following characteristics and anomalies were observed. The dataset contains different Indian cities Air pollutant information, air quality index and air quality bucket or index from year 2015 to 2020. The summary indicates there are incomplete or missing variable values through out the dataset and across different columns.The rows with maximum number of NAs has only the date and city name, those rows may not be useful and can be removed. AQI_Bucket has categorical variable and has missing values in various rows and that is not filled with NA and is of type character.The missing information is in different columns but it has useful measurements or values in other rows.This requires further analysis if those rows can be dropped or retained. There are rows with AQI and AQI_Bucket but missing values in certain columns, those rows can be retained considering there could be correlated information but the rows without AQI and AQI_Bucket can be considered for dropping.The missing values or null strings can be filled with NA and AQI_Bucket type can be converted to factor. Dropping columns or variables higher than 10% and the variables not required for analysis. Based on this dropping columns PM10, Benzene, Toluene and Xylene.
The analysis will only focus on major cities ie Bengaluru, Chennai, Delhi and Mumbai. The following dataset processing will focus to extract data only for the interested cities. After filtering the Dataset for specific cities, unique cities are checked on the resultant dataset to ensue if it contains only the required.
major_city_day<-city_day%>%filter(City %in% c("Chennai", "Bengaluru", "Delhi","Mumbai"))
summary(major_city_day)
## City Date PM2.5 NO
## Length:6568 Min. :2015-01-01 Min. : 1.72 Min. : 0.46
## Class :character 1st Qu.:2016-09-05 1st Qu.: 28.24 1st Qu.: 6.87
## Mode :character Median :2018-03-17 Median : 46.09 Median : 11.42
## Mean :2018-01-08 Mean : 65.01 Mean : 20.60
## 3rd Qu.:2019-05-18 3rd Qu.: 75.44 3rd Qu.: 23.89
## Max. :2020-07-01 Max. :685.36 Max. :221.03
## NA's :65 NA's :39
## NO2 NOx NH3 CO
## Min. : 2.49 Min. : 0.00 Min. : 0.02 Min. : 0.000
## 1st Qu.: 15.90 1st Qu.: 15.28 1st Qu.: 19.28 1st Qu.: 0.720
## Median : 26.04 Median : 25.50 Median : 31.11 Median : 0.990
## Mean : 31.47 Mean : 35.21 Mean : 40.05 Mean : 1.526
## 3rd Qu.: 41.03 3rd Qu.: 46.20 3rd Qu.: 46.22 3rd Qu.: 1.430
## Max. :162.50 Max. :254.80 Max. :352.89 Max. :48.070
## NA's :39 NA's :17 NA's :1005 NA's :18
## SO2 O3 AQI AQI_Bucket
## Min. : 0.73 Min. : 1.80 Min. : 20.0 Good : 174
## 1st Qu.: 5.07 1st Qu.: 23.31 1st Qu.: 79.0 Moderate :2238
## Median : 7.89 Median : 34.70 Median :111.0 Poor : 723
## Mean :10.24 Mean : 38.29 Mean :151.7 Satisfactory:2651
## 3rd Qu.:13.51 3rd Qu.: 48.73 3rd Qu.:188.0 Severe : 245
## Max. :71.56 Max. :257.73 Max. :716.0 Very Poor : 537
## NA's :112 NA's :224
unique(city_day$City)
## [1] "Ahmedabad" "Aizawl" "Amaravati"
## [4] "Amritsar" "Bengaluru" "Bhopal"
## [7] "Brajrajnagar" "Chandigarh" "Chennai"
## [10] "Coimbatore" "Delhi" "Ernakulam"
## [13] "Gurugram" "Guwahati" "Hyderabad"
## [16] "Jaipur" "Jorapokhar" "Kochi"
## [19] "Kolkata" "Lucknow" "Mumbai"
## [22] "Patna" "Shillong" "Talcher"
## [25] "Thiruvananthapuram" "Visakhapatnam"
unique(major_city_day$City)
## [1] "Bengaluru" "Chennai" "Delhi" "Mumbai"
Benfordness is checked without using Benford.analyis package instead used only basic functions from packages covered under the course. In contrary to the belief that the leading digit of a numerical data set 1 through 9 has equal probability, where as Benford’s law states that leading digits are distributed in a specific,and nonuniform way.It follows a lograthmic pattern with P(d)=log10(1+1/d), d=1,2..9. This holds true for all naturally or randomly occuring numbers of large dataset. As the airquality data is something which occurs naturally and it random. Benfordness should be holding true for the dataset if it is complete and there are no manipulation.
firstDigit<-function(x){
str_extract(x, "[123456789]")
#as.numeric(x)
#x=x^2
}
BenfordConfirmity<-function(x, y){
x=floor(x) + signif(x, 2)
if(0.000 <= x && x < 0.006)
print(paste(y, "Dataset has Close confirmity with Mean abosolute Deviation", x))
else if(0.006 <= x && x < 0.012)
print(paste(y , "Dataset has Acceptable confirmity with Mean abosolute Deviation", x))
else if(0.012 <= x && x <= 0.015)
print(paste(y, "Dataset has Marginally Acceptable confirmity with Mean abosolute Deviation", x))
else if(x > 0.015)
print(paste(y, "Dataset has no confirmity with Mean abosolute Deviation", x))
}
major_city_day_first_digit <-city_day %>% mutate(across(seq(3:13), firstDigit))%>%subset(select=-c(1,2,12))
PM2.5FD_Counts<-as.vector(table(major_city_day_first_digit$PM2.5))
PM2.5first_digit_actual_vs_expected <- data.frame(
digit = 1:9,
actual.count = PM2.5FD_Counts,
actual.fraction = PM2.5FD_Counts / nrow(major_city_day_first_digit),
benford.fraction = log10(1 + 1 / (1:9))
)
PM2.5first_digit_actual_vs_expected<-PM2.5first_digit_actual_vs_expected %>%mutate(difference.fraction=abs(benford.fraction-actual.fraction))
PM2.5first_digit_actual_vs_expected
BenfordConfirmity(mean(PM2.5first_digit_actual_vs_expected$difference.fraction), "PM2.5")
## [1] "PM2.5 Dataset has no confirmity with Mean abosolute Deviation 0.017"
NOFD_Counts<-as.vector(table(major_city_day_first_digit$NO))
NOfirst_digit_actual_vs_expected <- data.frame(
digit = 1:9,
actual.count = NOFD_Counts,
actual.fraction = NOFD_Counts / nrow(major_city_day_first_digit),
benford.fraction = log10(1 + 1 / (1:9))
)
NOfirst_digit_actual_vs_expected<-NOfirst_digit_actual_vs_expected %>%mutate(difference.fraction=abs(benford.fraction-actual.fraction))
NOfirst_digit_actual_vs_expected
BenfordConfirmity(mean(NOfirst_digit_actual_vs_expected$difference.fraction), "NO")
## [1] "NO Dataset has no confirmity with Mean abosolute Deviation 0.016"
O3FD_Counts<-as.vector(table(major_city_day_first_digit$O3))
O3first_digit_actual_vs_expected <- data.frame(
digit = 1:9,
actual.count = O3FD_Counts,
actual.fraction = O3FD_Counts / nrow(major_city_day_first_digit),
benford.fraction = log10(1 + 1 / (1:9))
)
O3first_digit_actual_vs_expected<-O3first_digit_actual_vs_expected %>%mutate(difference.fraction=abs(benford.fraction-actual.fraction))
O3first_digit_actual_vs_expected
BenfordConfirmity(mean(O3first_digit_actual_vs_expected$difference.fraction), "O3")
## [1] "O3 Dataset has no confirmity with Mean abosolute Deviation 0.036"
SO2FD_Counts<-as.vector(table(major_city_day_first_digit$SO2))
SO2first_digit_actual_vs_expected <- data.frame(
digit = 1:9,
actual.count = SO2FD_Counts,
actual.fraction = SO2FD_Counts / nrow(major_city_day_first_digit),
benford.fraction = log10(1 + 1 / (1:9))
)
SO2first_digit_actual_vs_expected<-SO2first_digit_actual_vs_expected %>%mutate(difference.fraction=abs(benford.fraction-actual.fraction))
SO2first_digit_actual_vs_expected
BenfordConfirmity(mean(SO2first_digit_actual_vs_expected$difference.fraction), "SO2")
## [1] "SO2 Dataset has no confirmity with Mean abosolute Deviation 0.023"
ggplot(PM2.5first_digit_actual_vs_expected, aes(x=digit, y=benford.fraction))+ geom_col()+ geom_line(aes(y=actual.fraction, color='red')) + scale_x_continuous(name ="digit", breaks = seq(1, 9, 1)) + labs(title = 'Benfordness of PM2.5 in City_day')
The benfordness check was done on selected pollutants like PM2.5, NO, O3 and SO2. More it was done as random sampling check. As per the check all four data doesnt have enough representation to confirms to Benford’s law. “PM2.5 Dataset has no confirmity with Mean abosolute Deviation 0.017” “NO Dataset has no confirmity with Mean abosolute Deviation 0.016” “O3 Dataset has no confirmity with Mean abosolute Deviation 0.036” “SO2 Dataset has no confirmity with Mean abosolute Deviation 0.023”
The possible reasons could be lot of missing data, as the source is unknown potentially could have been manipulated or errors in the recording. This might impact the final model which may to require to revisit the data.
city_day_monthwise<-major_city_day%>%group_by(City, year=year(Date), month = month(Date))%>%summarise(across(c(seq(3:13)), mean, na.rm=TRUE))
## Warning: There were 225 warnings in `summarise()`.
## The first warning was:
## ℹ In argument: `across(c(seq(3:13)), mean, na.rm = TRUE)`.
## ℹ In group 1: `City = "Bengaluru"`, `year = 2015`, `month = 3`.
## Caused by warning:
## ! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
## Supply arguments directly to `.fns` through an anonymous function instead.
##
## # Previously
## across(a:b, mean, na.rm = TRUE)
##
## # Now
## across(a:b, \(x) mean(x, na.rm = TRUE))
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 224 remaining warnings.
## `summarise()` has grouped output by 'City', 'year'. You can override using the
## `.groups` argument.
city_day_monthwise
## # A tibble: 224 Ă— 14
## # Groups: City, year [21]
## City year month Date PM2.5 NO NO2 NOx NH3 CO SO2 O3
## <chr> <dbl> <dbl> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Benga… 2015 3 2015-03-26 48.1 3.34 22.3 15.8 28.3 5.53 3.22 60.1
## 2 Benga… 2015 4 2015-04-15 28.7 3.89 8.88 10.6 23.6 3.83 8.31 63.8
## 3 Benga… 2015 5 2015-05-16 33.5 5.84 22.3 13.7 11.9 5.49 5.27 29.4
## 4 Benga… 2015 6 2015-06-15 18.3 2.72 13.5 9.17 11.5 3.21 6.61 21.0
## 5 Benga… 2015 7 2015-07-16 21.0 4.12 10.1 7.86 12.3 9.55 4.04 7.53
## 6 Benga… 2015 8 2015-08-16 21.1 4.24 13.9 9.79 24.7 14.1 3.02 6.19
## 7 Benga… 2015 9 2015-09-15 20.2 4.94 15.8 12.2 33.2 5.97 2.52 15.7
## 8 Benga… 2015 10 2015-10-16 31.4 5.29 28.1 20.8 31.5 2.28 3.13 17.9
## 9 Benga… 2015 11 2015-11-15 30.8 5.48 17.6 17.4 27.5 1.61 2.80 19.9
## 10 Benga… 2015 12 2015-12-16 46.0 16.2 43.5 32.6 30.3 1.64 5.65 21.5
## # ℹ 214 more rows
## # ℹ 2 more variables: AQI <dbl>, AQI_Bucket <dbl>
ggplot(city_day_monthwise, aes(x=month, y=AQI, color=factor(year))) + geom_line() + facet_wrap(~City) + scale_x_continuous(name ="month", breaks = seq(1, 12, 1))
AQI over different months for major cities are highly varying. Delhi has very high AQI over 200 different months and even goes above 400 compared to other cities. The AQI is high during start of the year and really high during end of the year. Across different cities the common observation over the give years the AQI has improved for the respective city AQI benchmarks and AQI vary across different months. This clearly indicates that seasonal changes and weather has impact on the AQI. Mumbai appears to have missing data.
city_day_yearwise<-major_city_day%>%group_by(City, year=year(Date))%>%summarise(across(c(seq(3:13)), mean, na.rm=TRUE))%>%filter(City %in% c("Ahmedabad", "Chennai", "Bengaluru", "Delhi","Mumbai"))
## Warning: There were 21 warnings in `summarise()`.
## The first warning was:
## ℹ In argument: `across(c(seq(3:13)), mean, na.rm = TRUE)`.
## ℹ In group 1: `City = "Bengaluru"`, `year = 2015`.
## Caused by warning in `mean.default()`:
## ! argument is not numeric or logical: returning NA
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 20 remaining warnings.
## `summarise()` has grouped output by 'City'. You can override using the
## `.groups` argument.
#city_day_yearwise
ggplot(city_day_yearwise, aes(x=year)) + geom_line(aes(y=PM2.5, color="PM2.5", na.rm=TRUE)) + facet_wrap(~City, nrow=5)
## Warning in geom_line(aes(y = PM2.5, color = "PM2.5", na.rm = TRUE)): Ignoring
## unknown aesthetics: na.rm
#sggplot(city_day_yearwise, aes(x=year)) + geom_bar(aes(y=NH3, color=City, na.rm=TRUE)) + facet_wrap(~City, nrow=5)
Similar to AQI Delhi has high PM2.5 concentration and next is chennai. Mumbai has low PM2.5 low PM2.5 concentration and no data available for 2015-2017. PM2.5 data for Mumbai may be incorrect given urbanization and industrialization in Mumbai. Bangalore has very low PM2.5 concentration.
ggplot(city_day_yearwise, aes(x=year)) + geom_line(aes(y=NO, color="NO")) + geom_line(aes(y=NO2, color="NO2")) + geom_line(aes(y=NOx, color="NOx"))+ facet_wrap(~City, nrow=5)
Compared to Chennai, Bangalore has high NO, NO2 and NOx concentration.Mumbai also higher concentration than Bangalore and chennai. Delhi similar to AQI and PM2.5, has higher concentration on NO, NO2 and NOx
ggplot(city_day_yearwise, aes(x=year)) + geom_line(aes(y=SO2, color="SO2")) + facet_wrap(~City, nrow=5)
Bangalore has very low SO2 concentration. SO2 concentration is high in both Mumbai and Delhi.
ggplot(city_day_yearwise, aes(x=year)) + geom_line(aes(y=CO, color="CO")) + facet_wrap(~City, nrow=5)
CO concentraion has reduced over the years for all the cities.
ggplot(city_day_yearwise, aes(x=year)) + geom_line(aes(y=O3, color="O3")) + facet_wrap(~City, nrow=5)
O3 concentration is higher in Delhi. Chennai and Bangalore had more or similar concentration. Mumbai has lowest O3 concentration.
#city_hour
head(city_hour)
glimpse(city_hour)
## Rows: 707,875
## Columns: 16
## $ City <chr> "Ahmedabad", "Ahmedabad", "Ahmedabad", "Ahmedabad", "Ahmeda…
## $ Datetime <chr> "2015-01-01 01:00:00", "2015-01-01 02:00:00", "2015-01-01 0…
## $ PM2.5 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ PM10 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ NO <dbl> 1.00, 0.02, 0.08, 0.30, 0.12, 0.33, 0.45, 1.03, 1.47, 2.05,…
## $ NO2 <dbl> 40.01, 27.75, 19.32, 16.45, 14.90, 15.95, 15.94, 16.66, 16.…
## $ NOx <dbl> 36.37, 19.73, 11.08, 9.20, 7.85, 10.82, 12.47, 16.48, 18.02…
## $ NH3 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ CO <dbl> 1.00, 0.02, 0.08, 0.30, 0.12, 0.33, 0.45, 1.03, 1.47, 2.05,…
## $ SO2 <dbl> 122.07, 85.90, 52.83, 39.53, 32.63, 29.87, 27.41, 20.92, 16…
## $ O3 <dbl> NA, NA, NA, 153.58, NA, 64.25, 191.96, 177.21, 122.08, NA, …
## $ Benzene <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ Toluene <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00,…
## $ Xylene <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ AQI <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ AQI_Bucket <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",…
city_hour <- replace(city_hour, city_hour=="", NA)
ch_na_total_per <-data.frame(colname = names(city_hour),Total_NAs = colSums(is.na(city_hour)), "PerctOfNAs" = colSums((is.na(city_hour)/sum(is.na(city_hour)) * 100)))
ch_na_total_per
ggplot(ch_na_total_per, aes(x=colname, y=PerctOfNAs))+geom_col()
maxna <-max(rowSums(is.na(city_hour)))
cd_rowswithmaxna<-subset(city_hour, rowSums(is.na(city_hour)) == maxna)
#cd_rowswithmorena
head(cd_rowswithmaxna)
city_hour<-subset(city_hour, rowSums(is.na(city_hour)) != maxna)
#city_hour
head(city_hour)
city_hour <- replace(city_hour, city_hour=="", NA)%>% drop_na(AQI_Bucket)
city_hour$AQI_Bucket <- as.factor(city_hour$AQI_Bucket)
#str(city_hour)
summary(city_hour)
## City Datetime PM2.5 PM10
## Length:578795 Length:578795 Min. : 0.01 Min. : 0.01
## Class :character Class :character 1st Qu.: 26.27 1st Qu.: 52.78
## Mode :character Mode :character Median : 46.50 Median : 92.00
## Mean : 67.51 Mean : 119.60
## 3rd Qu.: 79.48 3rd Qu.: 148.14
## Max. : 999.99 Max. :1000.00
## NA's :31035 NA's :178863
## NO NO2 NOx NH3
## Min. : 0.01 Min. : 0.01 Min. : 0.00 Min. : 0.01
## 1st Qu.: 3.87 1st Qu.: 10.96 1st Qu.: 10.97 1st Qu.: 8.50
## Median : 7.97 Median : 20.68 Median : 20.99 Median : 15.87
## Mean : 17.50 Mean : 29.27 Mean : 32.34 Mean : 23.98
## 3rd Qu.: 16.14 3rd Qu.: 36.92 3rd Qu.: 37.46 3rd Qu.: 29.65
## Max. :498.97 Max. :499.51 Max. :498.61 Max. :499.97
## NA's :21481 NA's :22267 NA's :51906 NA's :163994
## CO SO2 O3 Benzene
## Min. : 0.000 Min. : 0.01 Min. : 0.01 Min. : 0.00
## 1st Qu.: 0.490 1st Qu.: 4.95 1st Qu.: 13.79 1st Qu.: 0.15
## Median : 0.840 Median : 8.46 Median : 26.64 Median : 1.09
## Mean : 2.277 Mean : 14.01 Mean : 35.25 Mean : 3.33
## 3rd Qu.: 1.400 3rd Qu.: 14.77 3rd Qu.: 48.22 3rd Qu.: 3.03
## Max. :498.570 Max. :199.96 Max. :497.62 Max. :498.07
## NA's :29146 NA's :33058 NA's :33372 NA's :96365
## Toluene Xylene AQI AQI_Bucket
## Min. : 0.00 Min. : 0.0 Min. : 8.0 Good : 38611
## 1st Qu.: 0.75 1st Qu.: 0.3 1st Qu.: 79.0 Moderate :198991
## Median : 3.14 Median : 1.2 Median : 116.0 Poor : 66654
## Mean : 9.50 Mean : 3.7 Mean : 166.4 Satisfactory:189434
## 3rd Qu.: 9.56 3rd Qu.: 3.8 3rd Qu.: 208.0 Severe : 27650
## Max. :499.40 Max. :500.0 Max. :3133.0 Very Poor : 57455
## NA's :149001 NA's :371700
head(city_hour)
tail(city_hour)
In city hour data set, Xylene, NH3, PM10 has greater 10% of NAs. Other pollutants has nearly 5% of NAs. Remove rows with all NAs and rows with AQI and AQI bucket with NAs. Further determining and assigning seasons to every entry in the dataset based on date.
season<-function(date){
winter <- as.Date("2012-01-15", format = "%Y-%m-%d") # Winter
spring <- as.Date("2012-02-15", format = "%Y-%m-%d") # Spring
summer <- as.Date("2012-03-15", format = "%Y-%m-%d") # Summer
monsoon <- as.Date("2012-06-15", format = "%Y-%m-%d") # Monsoon
autumn <- as.Date("2012-10-15", format = "%Y-%m-%d") # autumn
prewinter <- as.Date("2012-12-15", format = "%Y-%m-%d") # prewinter
# Convert dates from any year to 2012 dates
d <- as.Date(strftime(date, format="2012-%m-%d"))
ifelse (d >= winter & d < spring, "Winter",
ifelse (d >= spring & d < summer, "Spring",
ifelse (d >= summer & d < monsoon, "Summer",
ifelse (d >= monsoon & d < autumn, "Monsoon",
ifelse (d >= autumn & d < prewinter, "Autumn", "Prewinter")))))
}
#season(as.Date("2015-11-20"))
#head(city_hour)
city_hour_sepdt<-city_hour %>% separate(Datetime, c( "Date", "Time"), " ")%>% filter(City %in% c("Chennai", "Bengaluru", "Delhi","Mumbai")) %>%mutate(Season=season(Date), Year = lubridate::year(Date))#%>%select(Year, Season, Time, PM2.5)
#head(city_hour_sepdt)
Change in AQI or pollutant concentration analysis is done for Chennai
city_hour_sepdt_byseason_ch<-city_hour_sepdt %>% filter(Year == 2020 & City == "Chennai") %>% group_by(City, Year,Season, Time) %>%summarise(across(c(seq(4:18)), mean, na.rm=TRUE)) #%>% summarise(PM2.5_Mean= mean(PM2.5))
## Warning: There were 240 warnings in `summarise()`.
## The first warning was:
## ℹ In argument: `across(c(seq(4:18)), mean, na.rm = TRUE)`.
## ℹ In group 1: `City = "Chennai"`, `Year = 2020`, `Season = "Monsoon"`, `Time =
## "00:00:00"`.
## Caused by warning in `mean.default()`:
## ! argument is not numeric or logical: returning NA
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 239 remaining warnings.
## `summarise()` has grouped output by 'City', 'Year', 'Season'. You can override
## using the `.groups` argument.
city_hour_sepdt_byseason_ch
## # A tibble: 120 Ă— 19
## # Groups: City, Year, Season [5]
## City Year Season Time Date PM2.5 PM10 NO NO2 NOx NH3 CO
## <chr> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Chennai 2020 Monsoon 00:00:… NA 28.0 40.5 7.05 8.94 15.5 34.4 1.02
## 2 Chennai 2020 Monsoon 01:00:… NA 23.2 41.8 8.36 8.97 16.9 34.4 1.03
## 3 Chennai 2020 Monsoon 02:00:… NA 22.4 39.2 8.10 9.43 17.0 34.6 1.07
## 4 Chennai 2020 Monsoon 03:00:… NA 22.8 40.2 7.83 10.0 17.2 34.5 1.09
## 5 Chennai 2020 Monsoon 04:00:… NA 25.8 37.2 8.00 10.9 18.3 34.6 1.10
## 6 Chennai 2020 Monsoon 05:00:… NA 34.8 31.3 8.20 11.7 19.5 35.5 1.09
## 7 Chennai 2020 Monsoon 06:00:… NA 32.8 30.4 8.50 12.4 20.4 34.4 1.10
## 8 Chennai 2020 Monsoon 07:00:… NA 27.1 32.6 9.18 13.4 22.2 34.2 1.15
## 9 Chennai 2020 Monsoon 08:00:… NA 34.3 34.4 9.33 14.2 23.0 34.3 1.19
## 10 Chennai 2020 Monsoon 09:00:… NA 34.9 31.4 8.84 12.4 20.8 34.5 1.18
## # ℹ 110 more rows
## # ℹ 7 more variables: SO2 <dbl>, O3 <dbl>, Benzene <dbl>, Toluene <dbl>,
## # Xylene <dbl>, AQI <dbl>, AQI_Bucket <dbl>
ggplot(city_hour_sepdt_byseason_ch, aes(x=Time, y=PM2.5, color=factor(Season))) + geom_point() + theme(axis.text.x = element_text(angle = 90)) #+ facet_wrap(~City, nrow=5)
ggplot(city_hour_sepdt_byseason_ch, aes(x=Time, y=O3, color=factor(Season))) + geom_point() + theme(axis.text.x = element_text(angle = 90)) #+ facet_wrap(~City, nrow=5)
ggplot(city_hour_sepdt_byseason_ch, aes(x=Time, y=NO, color=factor(Season))) + geom_point() + theme(axis.text.x = element_text(angle = 90))# + facet_wrap(~City, nrow=5)
ggplot(city_hour_sepdt_byseason_ch, aes(x=Time, y=CO, color=factor(Season))) + geom_point() + theme(axis.text.x = element_text(angle = 90)) #+ facet_wrap(~City, nrow=5)
ggplot(city_hour_sepdt_byseason_ch, aes(x=Time, y=SO2, color=factor(Season))) + geom_point() + theme(axis.text.x = element_text(angle = 90)) #+ facet_wrap(~City, nrow=5)
ggplot(city_hour_sepdt_byseason_ch, aes(x=Time, y=AQI, color=factor(Season))) + geom_point() + theme(axis.text.x = element_text(angle = 90)) #+ facet_wrap(~City, nrow=5)
For Chennai 1. PM2.5 peaks from morning till afternoon 1PM and the concentration high during prewinter and winter season. During summer concentration is low. 2. O3 is very high during monsoon and high during summer. O3 peaks during the peak hours 9 am to 6 pm. 3. CO is very high during monsoon and high during winter. CO peaks first half of the day and drops during second of the day. 4. SO2 varies through out the day, high during winter and low during summer. 5. AQI is high during monsoon and low during summer and peaks during afternoon till midnight 6. NO is high during early morning and low during afternoons.
city_hour_sepdt_byseason_ah<-city_hour_sepdt %>% filter(Year == 2020 & City == "Bengaluru") %>% group_by(City, Year,Season, Time) %>%summarise(across(c(seq(4:18)), mean, na.rm=TRUE)) #%>% summarise(PM2.5_Mean= mean(PM2.5))
## Warning: There were 240 warnings in `summarise()`.
## The first warning was:
## ℹ In argument: `across(c(seq(4:18)), mean, na.rm = TRUE)`.
## ℹ In group 1: `City = "Bengaluru"`, `Year = 2020`, `Season = "Monsoon"`, `Time
## = "00:00:00"`.
## Caused by warning in `mean.default()`:
## ! argument is not numeric or logical: returning NA
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 239 remaining warnings.
## `summarise()` has grouped output by 'City', 'Year', 'Season'. You can override
## using the `.groups` argument.
city_hour_sepdt_byseason_ah
## # A tibble: 120 Ă— 19
## # Groups: City, Year, Season [5]
## City Year Season Time Date PM2.5 PM10 NO NO2 NOx NH3 CO
## <chr> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Bengaluru 2020 Monsoon 00:0… NA 19.5 38.3 3.93 12.2 13.6 6.84 0.574
## 2 Bengaluru 2020 Monsoon 01:0… NA 16.1 32.9 3.84 11.0 12.8 6.65 0.482
## 3 Bengaluru 2020 Monsoon 02:0… NA 15.2 31.3 3.78 10.5 12.6 6.52 0.495
## 4 Bengaluru 2020 Monsoon 03:0… NA 15.0 31.0 3.71 10.2 12.4 6.43 0.471
## 5 Bengaluru 2020 Monsoon 04:0… NA 15.2 32.4 3.72 10.1 12.2 6.58 0.492
## 6 Bengaluru 2020 Monsoon 05:0… NA 13.9 33.9 3.81 10.0 12.3 6.67 0.611
## 7 Bengaluru 2020 Monsoon 06:0… NA 14.0 37.9 4.03 10.7 12.9 7.05 0.527
## 8 Bengaluru 2020 Monsoon 07:0… NA 14.8 33.8 4.65 12.1 14.4 6.77 0.554
## 9 Bengaluru 2020 Monsoon 08:0… NA 17.5 38.5 5.38 13.1 15.6 6.72 0.576
## 10 Bengaluru 2020 Monsoon 09:0… NA 20.5 39.4 5.76 13.7 16.5 6.81 0.648
## # ℹ 110 more rows
## # ℹ 7 more variables: SO2 <dbl>, O3 <dbl>, Benzene <dbl>, Toluene <dbl>,
## # Xylene <dbl>, AQI <dbl>, AQI_Bucket <dbl>
ggplot(city_hour_sepdt_byseason_ah, aes(x=Time, y=PM2.5, color=factor(Season))) + geom_point() + theme(axis.text.x = element_text(angle = 90)) #+ facet_wrap(~City, nrow=5)
ggplot(city_hour_sepdt_byseason_ah, aes(x=Time, y=O3, color=factor(Season))) + geom_point() + theme(axis.text.x = element_text(angle = 90)) #+ facet_wrap(~City, nrow=5)
ggplot(city_hour_sepdt_byseason_ah, aes(x=Time, y=NO, color=factor(Season))) + geom_point() + theme(axis.text.x = element_text(angle = 90))# + facet_wrap(~City, nrow=5)
ggplot(city_hour_sepdt_byseason_ah, aes(x=Time, y=CO, color=factor(Season))) + geom_point() + theme(axis.text.x = element_text(angle = 90)) #+ facet_wrap(~City, nrow=5)
ggplot(city_hour_sepdt_byseason_ah, aes(x=Time, y=SO2, color=factor(Season))) + geom_point() + theme(axis.text.x = element_text(angle = 90)) #+ facet_wrap(~City, nrow=5)
ggplot(city_hour_sepdt_byseason_ah, aes(x=Time, y=AQI, color=factor(Season))) + geom_point() + theme(axis.text.x = element_text(angle = 90)) #+ facet_wrap(~City, nrow=5)
For Bengaluru 1. PM2.5 peaks from morning till afternoon 12PM and the concentration high during spring and winter season. During Monsoon concentraion is low. 2. O3 is very high during Spring and high during winter and prewinter. O3 peaks during the peak during afternoon and midnight. 3. CO is very high during monsoon and high during winter. CO peaks first half of the day and drops during second of the day. 4. AQI drops during early morning and peaks in the afternoon, high during spring and low during monsoon. 5. SO2 is high during winter and low during summer and peaks during afternoon till midnight 6. NO high during winter and prewinter, peaks during early morning and late night
##Creating combined Data set with Station day and Weather for major cities. The below code extract the station Ids for the major cities that are being analyzed. The City name, Season, Year, Latitude, longitude, elevation data are also added to the combined data set for modeling.
head(stations)
summary(stations)
## StationId StationName City State
## Length:230 Length:230 Length:230 Length:230
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
## Status
## Length:230
## Class :character
## Mode :character
stations$City
## [1] "Amaravati" "Rajamahendravaram" "Tirupati"
## [4] "Vijayawada" "Visakhapatnam" "Guwahati"
## [7] "Gaya" "Gaya" "Hajipur"
## [10] "Muzaffarpur" "Patna" "Patna"
## [13] "Patna" "Patna" "Patna"
## [16] "Patna" "Chandigarh" "Delhi"
## [19] "Delhi" "Delhi" "Delhi"
## [22] "Delhi" "Delhi" "Delhi"
## [25] "Delhi" "Delhi" "Delhi"
## [28] "Delhi" "Delhi" "Delhi"
## [31] "Delhi" "Delhi" "Delhi"
## [34] "Delhi" "Delhi" "Delhi"
## [37] "Delhi" "Delhi" "Delhi"
## [40] "Delhi" "Delhi" "Delhi"
## [43] "Delhi" "Delhi" "Delhi"
## [46] "Delhi" "Delhi" "Delhi"
## [49] "Delhi" "Delhi" "Delhi"
## [52] "Delhi" "Delhi" "Delhi"
## [55] "Delhi" "Ahmedabad" "Ankleshwar"
## [58] "Gandhinagar" "Nandesari" "Vapi"
## [61] "Vatva" "Ambala" "Bahadurgarh"
## [64] "Ballabgarh" "Bhiwani" "Dharuhera"
## [67] "Faridabad" "Faridabad" "Faridabad"
## [70] "Faridabad" "Fatehabad" "Gurugram"
## [73] "Gurugram" "Gurugram" "Gurugram"
## [76] "Hisar" "Jind" "Kaithal"
## [79] "Karnal" "Kurukshetra" "Mandikhera"
## [82] "Manesar" "Narnaul" "Palwal"
## [85] "Panchkula" "Panipat" "Rohtak"
## [88] "Sirsa" "Sonipat" "Yamuna Nagar"
## [91] "Jorapokhar" "Bagalkot" "Bengaluru"
## [94] "Bengaluru" "Bengaluru" "Bengaluru"
## [97] "Bengaluru" "Bengaluru" "Bengaluru"
## [100] "Bengaluru" "Bengaluru" "Bengaluru"
## [103] "Chamarajanagar" "Chikkaballapur" "Chikkamagaluru"
## [106] "Hubballi" "Kalaburagi" "Mysuru"
## [109] "Ramanagara" "Vijayapura" "Yadgir"
## [112] "Eloor" "Ernakulam" "Kannur"
## [115] "Kochi" "Kollam" "Kozhikode"
## [118] "Thiruvananthapuram" "Thiruvananthapuram" "Bhopal"
## [121] "Damoh" "Dewas" "Gwalior"
## [124] "Gwalior" "Indore" "Jabalpur"
## [127] "Katni" "Maihar" "Mandideep"
## [130] "Pithampur" "Ratlam" "Sagar"
## [133] "Satna" "Singrauli" "Ujjain"
## [136] "Aurangabad" "Chandrapur" "Chandrapur"
## [139] "Kalyan" "Mumbai" "Mumbai"
## [142] "Mumbai" "Mumbai" "Mumbai"
## [145] "Mumbai" "Mumbai" "Mumbai"
## [148] "Mumbai" "Mumbai" "Nagpur"
## [151] "Nashik" "Navi Mumbai" "Navi Mumbai"
## [154] "Navi Mumbai" "Pune" "Solapur"
## [157] "Thane" "Shillong" "Aizawl"
## [160] "Brajrajnagar" "Talcher" "Amritsar"
## [163] "Bathinda" "Jalandhar" "Khanna"
## [166] "Ludhiana" "Gobindgarh" "Patiala"
## [169] "Rupnagar" "Alwar" "Ajmer"
## [172] "Bhiwandi" "Jaipur" "Jaipur"
## [175] "Jaipur" "Jodhpur" "Kota"
## [178] "Pali" "Udaipur" "Chennai"
## [181] "Chennai" "Chennai" "Chennai"
## [184] "Coimbatore" "Hyderabad" "Hyderabad"
## [187] "Hyderabad" "Hyderabad" "Hyderabad"
## [190] "Hyderabad" "Agra" "Baghpat"
## [193] "Bulandshahr" "Ghaziabad" "Ghaziabad"
## [196] "Ghaziabad" "Ghaziabad" "Greater Noida"
## [199] "Greater Noida" "Hapur" "Kanpur"
## [202] "Lucknow" "Lucknow" "Lucknow"
## [205] "Lucknow" "Lucknow" "Meerut"
## [208] "Meerut" "Meerut" "Moradabad"
## [211] "Muzzaffarnagar" "Noida" "Noida"
## [214] "Noida" "Noida" "Varanasi"
## [217] "Asansol" "Durgapur" "Haldia"
## [220] "Howrah" "Howrah" "Howrah"
## [223] "Kolkata" "Kolkata" "Kolkata"
## [226] "Kolkata" "Kolkata" "Kolkata"
## [229] "Kolkata" "Siliguri"
aq_delhi_stations<-stations %>%filter(City=="Delhi")
aq_delhi_stations
aq_mumbai_stations<-stations %>%filter(City=="Mumbai")
aq_mumbai_stations
aq_chennai_stations<-stations %>%filter(City=="Chennai")
aq_chennai_stations
aq_bengaluru_stations<-stations %>%filter(City=="Bengaluru")
aq_bengaluru_stations
head(station_hour)
summary(station_hour)
## StationId Datetime PM2.5 PM10
## Length:2589083 Length:2589083 Min. : 0.0 Min. : 0.0
## Class :character Class :character 1st Qu.: 28.2 1st Qu.: 64.0
## Mode :character Mode :character Median : 52.6 Median : 116.2
## Mean : 80.9 Mean : 158.5
## 3rd Qu.: 97.7 3rd Qu.: 204.0
## Max. :1000.0 Max. :1000.0
## NA's :647689 NA's :1119252
## NO NO2 NOx NH3
## Min. : 0.0 Min. : 0.0 Min. : 0.0 Min. : 0.0
## 1st Qu.: 3.0 1st Qu.: 13.1 1st Qu.: 11.3 1st Qu.: 11.2
## Median : 7.2 Median : 24.8 Median : 22.9 Median : 22.4
## Mean : 22.8 Mean : 35.2 Mean : 40.6 Mean : 28.7
## 3rd Qu.: 18.6 3rd Qu.: 45.5 3rd Qu.: 45.7 3rd Qu.: 37.8
## Max. :500.0 Max. :500.0 Max. :500.0 Max. :500.0
## NA's :553711 NA's :528973 NA's :490808 NA's :1236618
## CO SO2 O3 Benzene
## Min. : 0.0 Min. : 0.0 Min. : 0.0 Min. : 0.0
## 1st Qu.: 0.4 1st Qu.: 4.2 1st Qu.: 11.0 1st Qu.: 0.1
## Median : 0.8 Median : 8.2 Median : 24.8 Median : 1.0
## Mean : 1.5 Mean : 12.1 Mean : 38.1 Mean : 3.3
## 3rd Qu.: 1.4 3rd Qu.: 14.5 3rd Qu.: 49.5 3rd Qu.: 3.2
## Max. :498.6 Max. :200.0 Max. :997.0 Max. :498.1
## NA's :499302 NA's :742737 NA's :725973 NA's :861579
## Toluene Xylene AQI AQI_Bucket
## Min. : 0.0 Min. : 0.0 Min. : 5.0 Length:2589083
## 1st Qu.: 0.3 1st Qu.: 0.0 1st Qu.: 84.0 Class :character
## Median : 3.4 Median : 0.2 Median : 131.0 Mode :character
## Mean : 14.9 Mean : 2.4 Mean : 180.2
## 3rd Qu.: 15.1 3rd Qu.: 1.8 3rd Qu.: 259.0
## Max. :500.0 Max. :500.0 Max. :3133.0
## NA's :1042366 NA's :2075104 NA's :570190
maxna <-max(rowSums(is.na(station_hour)))
cd_rowswithmaxna<-subset(station_hour, rowSums(is.na(station_hour)) == maxna)
#cd_rowswithmorena
head(cd_rowswithmaxna)
station_hour<-subset(station_hour, rowSums(is.na(station_hour)) != maxna)
#station_hour
head(station_hour)
station_hour <- replace(station_hour, station_hour=="", NA)%>% drop_na(AQI_Bucket)
station_hour$AQI_Bucket <- as.factor(station_hour$AQI_Bucket)
#str(station_hour)
summary(station_hour)
## StationId Datetime PM2.5 PM10
## Length:2018893 Length:2018893 Min. : 0.01 Min. : 0.0
## Class :character Class :character 1st Qu.: 28.25 1st Qu.: 64.0
## Mode :character Mode :character Median : 52.72 Median : 116.5
## Mean : 80.76 Mean : 158.8
## 3rd Qu.: 97.64 3rd Qu.: 204.5
## Max. :1000.00 Max. :1000.0
## NA's :138787 NA's :597086
## NO NO2 NOx NH3
## Min. : 0.01 Min. : 0.01 Min. : 0.00 Min. : 0.0
## 1st Qu.: 3.00 1st Qu.: 13.15 1st Qu.: 12.66 1st Qu.: 11.4
## Median : 7.05 Median : 24.80 Median : 24.13 Median : 22.5
## Mean : 22.89 Mean : 35.13 Mean : 42.52 Mean : 28.6
## 3rd Qu.: 18.20 3rd Qu.: 45.41 3rd Qu.: 47.33 3rd Qu.: 37.8
## Max. :500.00 Max. :499.99 Max. :500.00 Max. :500.0
## NA's :124498 NA's :99059 NA's :157440 NA's :734261
## CO SO2 O3 Benzene
## Min. : 0.00 Min. : 0.01 Min. : 0.01 Min. : 0.0
## 1st Qu.: 0.48 1st Qu.: 4.30 1st Qu.: 10.92 1st Qu.: 0.1
## Median : 0.84 Median : 8.35 Median : 24.73 Median : 1.1
## Mean : 1.53 Mean : 12.15 Mean : 38.27 Mean : 3.5
## 3rd Qu.: 1.42 3rd Qu.: 14.62 3rd Qu.: 49.95 3rd Qu.: 3.4
## Max. :498.57 Max. :199.96 Max. :997.00 Max. :498.1
## NA's :170834 NA's :294848 NA's :274286 NA's :510708
## Toluene Xylene AQI AQI_Bucket
## Min. : 0.0 Min. : 0.0 Min. : 5.0 Good :152113
## 1st Qu.: 0.8 1st Qu.: 0.0 1st Qu.: 84.0 Moderate :675008
## Median : 4.2 Median : 0.4 Median : 131.0 Poor :239990
## Mean : 16.1 Mean : 2.8 Mean : 180.2 Satisfactory:530164
## 3rd Qu.: 16.8 3rd Qu.: 2.1 3rd Qu.: 259.0 Severe :120468
## Max. :500.0 Max. :500.0 Max. :3133.0 Very Poor :301150
## NA's :670687 NA's :1585481
head(station_hour)
tail(station_hour)
Extracting the lat, lon and elevation info from geo data. Banglore info was incorrect and it is correct with right values for lat, lon and elevation.
blr_geo<-geo_data %>%filter(Location_Name=="Bangalore")
blr_geo<-blr_geo%>%mutate(longitude=77.59, Latitude=12.97, Elevation=920)
chn_geo<-geo_data%>%filter(Location_Name=="Chennai")
dlhi_geo<-geo_data%>%filter(Location_Name=="Delhi")
mum_geo<-geo_data%>%filter(Location_Name=="Mumbai")
bhu_geo<-geo_data%>%filter(Location_Name=="Bhuvaneswar")
rourkela_geo<-geo_data%>%filter(Location_Name=="Rourkela")
rajas_geo<-geo_data%>%filter(Location_Name=="Rajastan")
lucknw_geo<-geo_data%>%filter(Location_Name=="Lucknow")
nrow(station_day)
## [1] 108035
ncol(station_day)
## [1] 16
head(station_day)
summary(station_day)
## StationId Date PM2.5 PM10
## Length:108035 Length:108035 Min. : 0.02 Min. : 0.01
## Class :character Class :character 1st Qu.: 31.88 1st Qu.: 70.15
## Mode :character Mode :character Median : 55.95 Median : 122.09
## Mean : 80.27 Mean : 157.97
## 3rd Qu.: 99.92 3rd Qu.: 208.67
## Max. :1000.00 Max. :1000.00
## NA's :21625 NA's :42706
## NO NO2 NOx NH3
## Min. : 0.01 Min. : 0.01 Min. : 0.00 Min. : 0.01
## 1st Qu.: 4.84 1st Qu.: 15.09 1st Qu.: 13.97 1st Qu.: 11.90
## Median : 10.29 Median : 27.21 Median : 26.66 Median : 23.59
## Mean : 23.12 Mean : 35.24 Mean : 41.20 Mean : 28.73
## 3rd Qu.: 24.98 3rd Qu.: 46.93 3rd Qu.: 50.50 3rd Qu.: 38.14
## Max. :470.00 Max. :448.05 Max. :467.63 Max. :418.90
## NA's :17106 NA's :16547 NA's :15500 NA's :48105
## CO SO2 O3 Benzene
## Min. : 0.000 Min. : 0.01 Min. : 0.01 Min. : 0.000
## 1st Qu.: 0.530 1st Qu.: 5.04 1st Qu.: 18.89 1st Qu.: 0.160
## Median : 0.910 Median : 8.95 Median : 30.84 Median : 1.210
## Mean : 1.606 Mean : 12.26 Mean : 38.13 Mean : 3.358
## 3rd Qu.: 1.450 3rd Qu.: 14.92 3rd Qu.: 47.14 3rd Qu.: 3.610
## Max. :175.810 Max. :195.65 Max. :963.00 Max. :455.030
## NA's :12998 NA's :25204 NA's :25568 NA's :31455
## Toluene Xylene AQI AQI_Bucket
## Min. : 0.00 Min. : 0.00 Min. : 8.0 Length:108035
## 1st Qu.: 0.69 1st Qu.: 0.00 1st Qu.: 86.0 Class :character
## Median : 4.33 Median : 0.40 Median : 132.0 Mode :character
## Mean : 15.35 Mean : 2.42 Mean : 179.7
## 3rd Qu.: 17.51 3rd Qu.: 2.11 3rd Qu.: 254.0
## Max. :454.85 Max. :170.37 Max. :2049.0
## NA's :38702 NA's :85137 NA's :21010
maxna <-max(rowSums(is.na(station_day)))
cd_rowswithmaxna<-subset(station_day, rowSums(is.na(station_day)) == maxna)
#cd_rowswithmorena
head(cd_rowswithmaxna)
station_day<-subset(station_day, rowSums(is.na(station_day)) != maxna)
#station_day
head(station_day)
station_day <- replace(station_day, station_day=="", NA)%>% drop_na(AQI_Bucket)
station_day$AQI_Bucket <- as.factor(station_day$AQI_Bucket)
#str(station_day)
summary(station_day)
## StationId Date PM2.5 PM10
## Length:87025 Length:87025 Min. : 0.04 Min. : 0.03
## Class :character Class :character 1st Qu.: 32.02 1st Qu.: 70.52
## Mode :character Mode :character Median : 56.17 Median :122.43
## Mean : 80.39 Mean :158.56
## 3rd Qu.: 100.17 3rd Qu.:209.47
## Max. :1000.00 Max. :976.77
## NA's :3488 NA's :23961
## NO NO2 NOx NH3
## Min. : 0.02 Min. : 0.01 Min. : 0.00 Min. : 0.01
## 1st Qu.: 4.79 1st Qu.: 15.18 1st Qu.: 15.51 1st Qu.: 12.05
## Median : 10.22 Median : 27.24 Median : 28.18 Median : 23.76
## Mean : 23.24 Mean : 35.12 Mean : 43.25 Mean : 28.65
## 3rd Qu.: 24.92 3rd Qu.: 46.83 3rd Qu.: 52.59 3rd Qu.: 38.15
## Max. :437.85 Max. :448.05 Max. :434.90 Max. :365.68
## NA's :2229 NA's :1566 NA's :4555 NA's :29832
## CO SO2 O3 Benzene
## Min. : 0.000 Min. : 0.01 Min. : 0.01 Min. : 0.000
## 1st Qu.: 0.600 1st Qu.: 5.09 1st Qu.: 18.92 1st Qu.: 0.260
## Median : 0.950 Median : 9.06 Median : 30.89 Median : 1.380
## Mean : 1.616 Mean : 12.21 Mean : 38.32 Mean : 3.567
## 3rd Qu.: 1.470 3rd Qu.: 14.98 3rd Qu.: 47.45 3rd Qu.: 3.800
## Max. :175.810 Max. :186.08 Max. :963.00 Max. :455.030
## NA's :2896 NA's :9533 NA's :9598 NA's :19787
## Toluene Xylene AQI AQI_Bucket
## Min. : 0.00 Min. : 0.00 Min. : 8.0 Good : 5510
## 1st Qu.: 1.10 1st Qu.: 0.05 1st Qu.: 86.0 Moderate :29417
## Median : 5.31 Median : 0.60 Median : 132.0 Poor :11493
## Mean : 16.56 Mean : 2.77 Mean : 179.7 Satisfactory:23636
## 3rd Qu.: 19.32 3rd Qu.: 2.66 3rd Qu.: 254.0 Severe : 5207
## Max. :454.85 Max. :170.37 Max. :2049.0 Very Poor :11762
## NA's :26324 NA's :67584
head(station_day)
tail(station_day)
findCity<-function(x)
{
ifelse(x %in% aq_delhi_stations$StationId,"Delhi",
ifelse(x %in% aq_mumbai_stations$StationId,"Mumbai",
ifelse(x %in% aq_chennai_stations$StationId,"Chennai",
ifelse(x %in% aq_bengaluru_stations$StationId,"Bengaluru","Other"))))
}
station_day_city<-station_day%>%mutate(City=findCity(StationId))
#station_day_city
major_city_station_day<-station_day_city%>%filter(City %in% c("Chennai", "Bengaluru", "Delhi","Mumbai"))
#View(major_city_station_day)
st_na_total_per <-data.frame(colname = names(major_city_station_day),Total_NAs = colSums(is.na(major_city_station_day)), "PerctOfNAs" = colSums((is.na(major_city_station_day)/sum(is.na(major_city_station_day)) * 100)))
#st_na_total_per
ggplot(st_na_total_per, aes(x=colname, y=PerctOfNAs))+geom_col()
#major_city_station_day<-major_city_station_day[complete.cases(major_city_station_day),]
#major_city_station_day
major_city_station_day<-major_city_station_day%>%mutate(Latitude = case_when(City == "Chennai" ~ chn_geo$Latitude, City == "Bengaluru" ~ blr_geo$Latitude, City == "Mumbai" ~ mum_geo$Latitude,City == "Delhi" ~ dlhi_geo$Latitude),Longitude = case_when(City == "Chennai" ~ chn_geo$longitude, City == "Bengaluru" ~ blr_geo$longitude, City == "Mumbai" ~ mum_geo$longitude,City == "Delhi" ~ dlhi_geo$longitude),Elevation = case_when(City == "Chennai" ~ chn_geo$Elevation, City == "Bengaluru" ~ blr_geo$Elevation, City == "Mumbai" ~ mum_geo$Elevation,City == "Delhi" ~ dlhi_geo$Elevation))
#major_city_station_day
station_day data set is cleaned for missing values. Compared to City_day dataset station dataset is better and has less NA values for major pollutants. Unwanted pollutants can be removed from combined data set.
###Performing benfordness check for station_day data set
major_city_station_day_first_digit <-major_city_station_day %>% mutate(across(seq(3:17), firstDigit))#%>%subset(select=-c(1,2,12))
major_city_station_day_first_digit
PM2.5FD_Counts_stday<-as.vector(table(major_city_station_day_first_digit$PM2.5))
PM2.5first_digit_actual_vs_expected_stday <- data.frame(
digit = 1:9,
actual.count = PM2.5FD_Counts_stday,
actual.fraction = PM2.5FD_Counts_stday / nrow(major_city_station_day_first_digit),
benford.fraction = log10(1 + 1 / (1:9))
)
PM2.5first_digit_actual_vs_expected_stday<-PM2.5first_digit_actual_vs_expected_stday %>%mutate(difference.fraction=abs(benford.fraction-actual.fraction))
PM2.5first_digit_actual_vs_expected_stday
BenfordConfirmity(mean(PM2.5first_digit_actual_vs_expected_stday$difference.fraction), "PM2.5")
## [1] "PM2.5 Dataset has Acceptable confirmity with Mean abosolute Deviation 0.0083"
AQIFD_Counts_stday<-as.vector(table(major_city_station_day_first_digit$AQI))
AQIfirst_digit_actual_vs_expected_stday <- data.frame(
digit = 1:9,
actual.count = AQIFD_Counts_stday,
actual.fraction = AQIFD_Counts_stday / nrow(major_city_station_day_first_digit),
benford.fraction = log10(1 + 1 / (1:9))
)
AQIfirst_digit_actual_vs_expected_stday<-AQIfirst_digit_actual_vs_expected_stday %>%mutate(difference.fraction=abs(benford.fraction-actual.fraction))
AQIfirst_digit_actual_vs_expected_stday
BenfordConfirmity(mean(AQIfirst_digit_actual_vs_expected_stday$difference.fraction), "AQI")
## [1] "AQI Dataset has no confirmity with Mean abosolute Deviation 0.019"
NOFD_Counts_stday<-as.vector(table(major_city_station_day_first_digit$NO))
NOfirst_digit_actual_vs_expected_stday <- data.frame(
digit = 1:9,
actual.count = NOFD_Counts_stday,
actual.fraction = NOFD_Counts_stday / nrow(major_city_station_day_first_digit),
benford.fraction = log10(1 + 1 / (1:9))
)
NOfirst_digit_actual_vs_expected_stday<-NOfirst_digit_actual_vs_expected_stday %>%mutate(difference.fraction=abs(benford.fraction-actual.fraction))
NOfirst_digit_actual_vs_expected_stday
BenfordConfirmity(mean(NOfirst_digit_actual_vs_expected_stday$difference.fraction), "NO")
## [1] "NO Dataset has Close confirmity with Mean abosolute Deviation 0.0058"
NO2FD_Counts_stday<-as.vector(table(major_city_station_day_first_digit$NO2))
NO2first_digit_actual_vs_expected_stday <- data.frame(
digit = 1:9,
actual.count = NO2FD_Counts_stday,
actual.fraction = NO2FD_Counts_stday / nrow(major_city_station_day_first_digit),
benford.fraction = log10(1 + 1 / (1:9))
)
NO2first_digit_actual_vs_expected_stday<-NO2first_digit_actual_vs_expected_stday %>%mutate(difference.fraction=abs(benford.fraction-actual.fraction))
NO2first_digit_actual_vs_expected_stday
BenfordConfirmity(mean(NO2first_digit_actual_vs_expected_stday$difference.fraction), "NO2")
## [1] "NO2 Dataset has no confirmity with Mean abosolute Deviation 0.016"
COFD_Counts_stday<-as.vector(table(major_city_station_day_first_digit$CO))
COfirst_digit_actual_vs_expected_stday <- data.frame(
digit = 1:9,
actual.count = COFD_Counts_stday,
actual.fraction = COFD_Counts_stday / nrow(major_city_station_day_first_digit),
benford.fraction = log10(1 + 1 / (1:9))
)
COfirst_digit_actual_vs_expected_stday<-COfirst_digit_actual_vs_expected_stday %>%mutate(difference.fraction=abs(benford.fraction-actual.fraction))
COfirst_digit_actual_vs_expected_stday
BenfordConfirmity(mean(COfirst_digit_actual_vs_expected_stday$difference.fraction), "CO")
## [1] "CO Dataset has no confirmity with Mean abosolute Deviation 0.035"
SO2FD_Counts_stday<-as.vector(table(major_city_station_day_first_digit$SO2))
SO2first_digit_actual_vs_expected_stday <- data.frame(
digit = 1:9,
actual.count = SO2FD_Counts_stday,
actual.fraction = SO2FD_Counts_stday / nrow(major_city_station_day_first_digit),
benford.fraction = log10(1 + 1 / (1:9))
)
SO2first_digit_actual_vs_expected_stday<-NO2first_digit_actual_vs_expected_stday %>%mutate(difference.fraction=abs(benford.fraction-actual.fraction))
SO2first_digit_actual_vs_expected_stday
BenfordConfirmity(mean(SO2first_digit_actual_vs_expected_stday$difference.fraction), "SO2")
## [1] "SO2 Dataset has no confirmity with Mean abosolute Deviation 0.016"
NOXFD_Counts_stday<-as.vector(table(major_city_station_day_first_digit$NOX))
NOXfirst_digit_actual_vs_expected_stday <- data.frame(
digit = 1:9,
actual.count = SO2FD_Counts_stday,
actual.fraction = SO2FD_Counts_stday / nrow(major_city_station_day_first_digit),
benford.fraction = log10(1 + 1 / (1:9))
)
NOXfirst_digit_actual_vs_expected_stday<-NOXfirst_digit_actual_vs_expected_stday %>%mutate(difference.fraction=abs(benford.fraction-actual.fraction))
NOXfirst_digit_actual_vs_expected_stday
BenfordConfirmity(mean(NOXfirst_digit_actual_vs_expected_stday$difference.fraction), "NOX")
## [1] "NOX Dataset has no confirmity with Mean abosolute Deviation 0.017"
“PM2.5 Dataset has Acceptable confirmity with Mean abosolute Deviation 0.0083” has accepatable confirmity and “NO Dataset has Close confirmity with Mean abosolute Deviation 0.0058” Rest has no conformity with minor variation. [1] “AQI Dataset has no confirmity with Mean abosolute Deviation 0.019” [1] “NO2 Dataset has no confirmity with Mean abosolute Deviation 0.016” [1] “CO Dataset has no confirmity with Mean abosolute Deviation 0.035” [1] “SO2 Dataset has no confirmity with Mean abosolute Deviation 0.016” [1] “NOX Dataset has no confirmity with Mean abosolute Deviation 0.017”
head(blr_wthr)
summary(blr_wthr)
## time tavg tmin tmax
## Length:11894 Min. :17.20 Min. : 9.30 Min. :19.80
## Class :character 1st Qu.:22.30 1st Qu.:18.10 1st Qu.:27.90
## Mode :character Median :23.50 Median :19.80 Median :29.50
## Mean :23.84 Mean :19.39 Mean :29.93
## 3rd Qu.:25.20 3rd Qu.:20.80 3rd Qu.:32.00
## Max. :32.40 Max. :27.90 Max. :39.20
## NA's :70 NA's :1389 NA's :629
## prcp
## Min. : 0.000
## 1st Qu.: 0.000
## Median : 0.000
## Mean : 4.414
## 3rd Qu.: 2.000
## Max. :271.300
## NA's :4620
head(chn_wthr)
summary(chn_wthr)
## time tavg tmin tmax
## Length:11894 Min. :20.90 Min. :12.00 Min. :23.80
## Class :character 1st Qu.:26.30 1st Qu.:22.60 1st Qu.:31.10
## Mode :character Median :28.70 Median :24.60 Median :34.00
## Mean :28.49 Mean :24.38 Mean :33.91
## 3rd Qu.:30.40 3rd Qu.:26.40 3rd Qu.:36.20
## Max. :36.60 Max. :31.00 Max. :44.60
## NA's :27 NA's :3084 NA's :1019
## prcp
## Min. : 0.000
## 1st Qu.: 0.000
## Median : 0.000
## Mean : 6.244
## 3rd Qu.: 3.000
## Max. :344.900
## NA's :4886
head(dlhi_wthr)
summary(dlhi_wthr)
## time tavg tmin tmax
## Length:11894 Min. : 6.6 Min. : 0.10 Min. : 9.80
## Class :character 1st Qu.:18.5 1st Qu.:11.80 1st Qu.:26.70
## Mode :character Median :27.0 Median :20.00 Median :33.20
## Mean :25.0 Mean :18.88 Mean :31.79
## 3rd Qu.:30.9 3rd Qu.:26.00 3rd Qu.:36.60
## Max. :39.8 Max. :34.20 Max. :48.10
## NA's :94 NA's :1536 NA's :533
## prcp
## Min. : 0.000
## 1st Qu.: 0.000
## Median : 0.000
## Mean : 3.662
## 3rd Qu.: 0.500
## Max. :262.900
## NA's :6140
head(lucknw_wthr)
summary(lucknw_wthr)
## time tavg tmin tmax
## Length:11894 Min. : 5.70 Min. :-0.6 Min. :11.10
## Class :character 1st Qu.:19.50 1st Qu.:12.5 1st Qu.:28.10
## Mode :character Median :27.20 Median :20.5 Median :33.40
## Mean :25.22 Mean :18.8 Mean :32.49
## 3rd Qu.:30.40 3rd Qu.:25.1 3rd Qu.:36.50
## Max. :39.70 Max. :32.7 Max. :47.30
## NA's :138 NA's :3515 NA's :1553
## prcp
## Min. : 0.000
## 1st Qu.: 0.000
## Median : 0.000
## Mean : 4.536
## 3rd Qu.: 1.000
## Max. :470.900
## NA's :6152
head(mum_wthr)
summary(mum_wthr)
## time tavg tmin tmax
## Length:11894 Min. :17.70 Min. : 8.50 Min. :22.30
## Class :character 1st Qu.:26.60 1st Qu.:19.80 1st Qu.:30.90
## Mode :character Median :28.10 Median :23.70 Median :32.40
## Mean :27.76 Mean :22.62 Mean :32.31
## 3rd Qu.:29.30 3rd Qu.:25.40 3rd Qu.:33.90
## Max. :33.70 Max. :30.40 Max. :41.30
## NA's :11 NA's :2454 NA's :1907
## prcp
## Min. : 0.00
## 1st Qu.: 0.00
## Median : 0.00
## Mean : 10.94
## 3rd Qu.: 7.10
## Max. :461.00
## NA's :4681
head(geo_data)
head(bhu_wthr)
summary(bhu_wthr)
## time tavg tmin tmax
## Length:11935 Min. :15.70 Min. : 8.20 Min. :19.4
## Class :character 1st Qu.:24.70 1st Qu.:19.00 1st Qu.:30.4
## Mode :character Median :27.70 Median :24.00 Median :32.8
## Mean :26.99 Mean :22.24 Mean :33.0
## 3rd Qu.:29.40 3rd Qu.:25.60 3rd Qu.:35.4
## Max. :37.40 Max. :31.80 Max. :46.7
## NA's :78 NA's :2090 NA's :891
## prcp snow wdir wspd
## Min. : 0.000 Mode:logical Min. : 0.0 Min. : 0.500
## 1st Qu.: 0.000 NA's:11935 1st Qu.: 89.0 1st Qu.: 4.500
## Median : 0.000 Median :188.0 Median : 7.000
## Mean : 7.074 Mean :169.1 Mean : 8.399
## 3rd Qu.: 4.100 3rd Qu.:220.8 3rd Qu.:11.000
## Max. :470.900 Max. :359.0 Max. :33.100
## NA's :5097 NA's :10641 NA's :9806
## wpgt pres tsun
## Mode:logical Min. : 990.6 Mode:logical
## NA's:11935 1st Qu.:1002.9 NA's:11935
## Median :1007.3
## Mean :1007.4
## 3rd Qu.:1012.4
## Max. :1019.3
## NA's :10692
head(rourkela_wthr)
summary(rourkela_wthr)
## time tavg tmin tmax
## Length:426 Min. :14.60 Min. : 8.20 Min. :21.50
## Class :character 1st Qu.:24.40 1st Qu.:18.18 1st Qu.:29.60
## Mode :character Median :28.10 Median :25.20 Median :32.10
## Mean :26.71 Mean :22.30 Mean :32.25
## 3rd Qu.:29.30 3rd Qu.:26.10 3rd Qu.:33.80
## Max. :35.00 Max. :29.30 Max. :43.60
## NA's :2 NA's :2 NA's :2
## prcp snow wdir wspd
## Min. : 0.000 Mode:logical Min. : 0.0 Min. : 2.900
## 1st Qu.: 0.000 NA's:426 1st Qu.: 49.0 1st Qu.: 5.500
## Median : 0.200 Median :168.0 Median : 6.600
## Mean : 5.695 Mean :140.3 Mean : 7.441
## 3rd Qu.: 7.200 3rd Qu.:195.2 3rd Qu.: 8.725
## Max. :123.000 Max. :359.0 Max. :20.400
## NA's :3 NA's :2 NA's :2
## wpgt pres tsun
## Mode:logical Min. : 993.1 Mode:logical
## NA's:426 1st Qu.:1002.5 NA's:426
## Median :1005.5
## Mean :1006.8
## 3rd Qu.:1012.1
## Max. :1020.6
## NA's :2
head(rajas_wthr)
summary(rajas_wthr)
## time tavg tmin tmax
## Length:11894 Min. :17.20 Min. : 9.30 Min. :19.80
## Class :character 1st Qu.:22.30 1st Qu.:18.10 1st Qu.:27.90
## Mode :character Median :23.50 Median :19.80 Median :29.50
## Mean :23.84 Mean :19.39 Mean :29.93
## 3rd Qu.:25.20 3rd Qu.:20.80 3rd Qu.:32.00
## Max. :32.40 Max. :27.90 Max. :39.20
## NA's :70 NA's :1389 NA's :629
## prcp
## Min. : 0.000
## 1st Qu.: 0.000
## Median : 0.000
## Mean : 4.414
## 3rd Qu.: 2.000
## Max. :271.300
## NA's :4620
bhu_wthr_clnd<-bhu_wthr[complete.cases(bhu_wthr),] %>%filter(between(as.Date(time, format = "%d-%m-%Y"), as.Date('31-12-2014', format = "%d-%m-%Y"), as.Date('01-01-2021', format = "%d-%m-%Y")))%>%mutate(City="Bhuvaneswar", Latitude=bhu_geo$Latitude, Longitude=bhu_geo$longitude, Elevation=bhu_geo$Elevation, Season=season(ymd(as.Date(time))))
#View(bhu_wthr_clnd)
rourkela_wthr_clnd<-rourkela_wthr[complete.cases(rourkela_wthr),] %>%filter(between(as.Date(time, format = "%d-%m-%Y"), as.Date('31-12-2014', format = "%d-%m-%Y"), as.Date('01-01-2021', format = "%d-%m-%Y")))%>%mutate(City="Rourkela", Latitude=rourkela_geo$Latitude, Longitude=rourkela_geo$longitude, Elevation=rourkela_geo$Elevation, Season=season(ymd(as.Date(time))))
#View(rourkela_wthr_clnd)
rajas_wthr_clnd<-rajas_wthr[complete.cases(rajas_wthr),]%>%filter(between(as.Date(time, format = "%d-%m-%Y"), as.Date('31-12-2014', format = "%d-%m-%Y"), as.Date('01-01-2021', format = "%d-%m-%Y")))%>%mutate(City="Rajasthan", Latitude=rajas_geo$Latitude, Longitude=rajas_geo$longitude, Elevation=rajas_geo$Elevation, Season=season(ymd(as.Date(time))))
#View(rajas_wthr_clnd)
rajas_wthr_clnd_seasn<-rajas_wthr_clnd %>% mutate(Year = lubridate::year(as.Date(time, format = "%d-%m-%Y"))) %>% group_by(City, Year, Season) %>% summarise(across(c(2:5), mean, na.rm = TRUE))
## `summarise()` has grouped output by 'City', 'Year'. You can override using the
## `.groups` argument.
#rajas_wthr_clnd_seasn
lucknw_wthr_clnd<-lucknw_wthr[complete.cases(lucknw_wthr),] %>%filter(between(as.Date(time, format = "%d-%m-%Y"), as.Date('31-12-2014', format = "%d-%m-%Y"), as.Date('01-01-2021', format = "%d-%m-%Y")))%>%mutate(City="Lucknow", Latitude=lucknw_geo$Latitude, Longitude=lucknw_geo$longitude, Elevation=lucknw_geo$Elevation, Season=season(ymd(as.Date(time))))
#View(lucknw_wthr_clnd)
lucknw_wthr_clnd_seasn<-lucknw_wthr_clnd %>% mutate(Year = lubridate::year(as.Date(time, format = "%d-%m-%Y"))) %>% group_by(City, Year, Season) %>% summarise(across(c(2:5), mean, na.rm = TRUE))
## `summarise()` has grouped output by 'City', 'Year'. You can override using the
## `.groups` argument.
#lucknw_wthr_clnd_seasn
blr_wthr_clnd<-blr_wthr[complete.cases(blr_wthr),] %>% filter(between(as.Date(time, format = "%d-%m-%Y"), as.Date('31-12-2014', format = "%d-%m-%Y"), as.Date('01-01-2021', format = "%d-%m-%Y")))%>%mutate(City="Bengaluru", Latitude=blr_geo$Latitude, Longitude=blr_geo$longitude, Elevation=blr_geo$Elevation, Season=season(ymd(as.Date(time))))
#View(blr_wthr_clnd)
blr_wthr_clnd_seasn<-blr_wthr_clnd %>% mutate(Year = lubridate::year(as.Date(time, format = "%d-%m-%Y"))) %>% group_by(City,Latitude,Longitude, Elevation,Year, Season) %>% summarise(across(c(2:5), mean, na.rm = TRUE))
## `summarise()` has grouped output by 'City', 'Latitude', 'Longitude',
## 'Elevation', 'Year'. You can override using the `.groups` argument.
#blr_wthr_clnd_seasn
names(blr_wthr_clnd_seasn)
## [1] "City" "Latitude" "Longitude" "Elevation" "Year" "Season"
## [7] "tavg" "tmin" "tmax" "prcp"
ggplot(blr_wthr_clnd_seasn, aes(x=Year))+ geom_line(aes(y=blr_wthr_clnd_seasn$tmax, color=blr_wthr_clnd_seasn$Season)) + scale_x_continuous(name ="year", breaks = seq(2015, 2020, 1))
ggplot(blr_wthr_clnd_seasn, aes(x=Year))+ geom_line(aes(y=blr_wthr_clnd_seasn$tavg, color=blr_wthr_clnd_seasn$Season))+ scale_x_continuous(name ="year", breaks = seq(2015, 2020, 1))
ggplot(blr_wthr_clnd_seasn, aes(x=Year))+ geom_line(aes(y=blr_wthr_clnd_seasn$tmin, color=blr_wthr_clnd_seasn$Season))+ scale_x_continuous(name ="year", breaks = seq(2015, 2020, 1))
ggplot(blr_wthr_clnd_seasn, aes(x=Year))+ geom_line(aes(y=blr_wthr_clnd_seasn$prcp, color=blr_wthr_clnd_seasn$Season))+ scale_x_continuous(name ="year", breaks = seq(2015, 2020, 1))
chn_wthr_clnd<-chn_wthr[complete.cases(chn_wthr),] %>% filter(between(as.Date(time, format = "%d-%m-%Y"), as.Date('31-12-2014', format = "%d-%m-%Y"), as.Date('01-01-2021', format = "%d-%m-%Y")))%>%mutate(City="Chennai", Latitude=chn_geo$Latitude, Longitude=chn_geo$longitude, Elevation=chn_geo$Elevation, Season=season(ymd(as.Date(time))))
#View(chn_wthr_clnd)
chn_wthr_clnd_seasn<-chn_wthr_clnd %>% mutate(Year = lubridate::year(as.Date(time, format = "%d-%m-%Y"))) %>% group_by(City,Latitude,Longitude, Elevation,Year, Season) %>% summarise(across(c(2:5), mean, na.rm = TRUE))
## `summarise()` has grouped output by 'City', 'Latitude', 'Longitude',
## 'Elevation', 'Year'. You can override using the `.groups` argument.
#chn_wthr_clnd_seasn
ggplot(chn_wthr_clnd_seasn, aes(x=Year))+ geom_line(aes(y=chn_wthr_clnd_seasn$tmax, color=chn_wthr_clnd_seasn$Season)) + scale_x_continuous(name ="year", breaks = seq(2015, 2020, 1))
ggplot(chn_wthr_clnd_seasn, aes(x=Year))+ geom_line(aes(y=chn_wthr_clnd_seasn$tavg, color=chn_wthr_clnd_seasn$Season))+ scale_x_continuous(name ="year", breaks = seq(2015, 2020, 1))
ggplot(chn_wthr_clnd_seasn, aes(x=Year))+ geom_line(aes(y=chn_wthr_clnd_seasn$tmin, color=chn_wthr_clnd_seasn$Season))+ scale_x_continuous(name ="year", breaks = seq(2015, 2020, 1))
ggplot(chn_wthr_clnd_seasn, aes(x=Year))+ geom_line(aes(y=chn_wthr_clnd_seasn$prcp, color=chn_wthr_clnd_seasn$Season))+ scale_x_continuous(name ="year", breaks = seq(2015, 2020, 1))
dlhi_wthr_clnd<-dlhi_wthr[complete.cases(dlhi_wthr),] %>%filter(between(as.Date(time, format = "%d-%m-%Y"), as.Date('31-12-2014', format = "%d-%m-%Y"), as.Date('01-01-2021', format = "%d-%m-%Y")))%>%mutate(City="Delhi", Latitude=dlhi_geo$Latitude, Longitude=dlhi_geo$longitude, Elevation=dlhi_geo$Elevation, Season=season(ymd(as.Date(time))))
#View(dlhi_wthr_clnd)
dlhi_wthr_clnd_seasn<-dlhi_wthr_clnd %>% mutate(Year = lubridate::year(as.Date(time, format = "%d-%m-%Y"))) %>% group_by(City,Latitude,Longitude, Elevation,Year, Season) %>% summarise(across(c(2:5), mean, na.rm = TRUE))
## `summarise()` has grouped output by 'City', 'Latitude', 'Longitude',
## 'Elevation', 'Year'. You can override using the `.groups` argument.
#dlhi_wthr_clnd_seasn
ggplot(dlhi_wthr_clnd_seasn, aes(x=Year))+ geom_line(aes(y=dlhi_wthr_clnd_seasn$tmax, color=dlhi_wthr_clnd_seasn$Season)) + scale_x_continuous(name ="year", breaks = seq(2015, 2020, 1))
ggplot(dlhi_wthr_clnd_seasn, aes(x=Year))+ geom_line(aes(y=dlhi_wthr_clnd_seasn$tavg, color=dlhi_wthr_clnd_seasn$Season))+ scale_x_continuous(name ="year", breaks = seq(2015, 2020, 1))
ggplot(dlhi_wthr_clnd_seasn, aes(x=Year))+ geom_line(aes(y=dlhi_wthr_clnd_seasn$tmin, color=dlhi_wthr_clnd_seasn$Season))+ scale_x_continuous(name ="year", breaks = seq(2015, 2020, 1))
ggplot(dlhi_wthr_clnd_seasn, aes(x=Year))+ geom_line(aes(y=dlhi_wthr_clnd_seasn$prcp, color=dlhi_wthr_clnd_seasn$Season))+ scale_x_continuous(name ="year", breaks = seq(2015, 2020, 1))
mum_wthr_clnd<-mum_wthr[complete.cases(mum_wthr),] %>%filter(between(as.Date(time, format = "%d-%m-%Y"), as.Date('31-12-2014', format = "%d-%m-%Y"), as.Date('01-01-2021', format = "%d-%m-%Y")))%>%mutate(City="Mumbai", Latitude=mum_geo$Latitude, Longitude=mum_geo$longitude, Elevation=mum_geo$Elevation, Season=season(ymd(as.Date(time))))
mum_wthr_clnd_seasn<-mum_wthr_clnd %>% mutate(Year = lubridate::year(as.Date(time, format = "%d-%m-%Y"))) %>% group_by(City,Latitude,Longitude, Elevation,Year, Season) %>% summarise(across(c(2:5), mean, na.rm = TRUE))
## `summarise()` has grouped output by 'City', 'Latitude', 'Longitude',
## 'Elevation', 'Year'. You can override using the `.groups` argument.
#mum_wthr_clnd_seasn
ggplot(mum_wthr_clnd_seasn, aes(x=Year))+ geom_line(aes(y=mum_wthr_clnd_seasn$tmax, color=mum_wthr_clnd_seasn$Season)) + scale_x_continuous(name ="year", breaks = seq(2015, 2020, 1))
ggplot(mum_wthr_clnd_seasn, aes(x=Year))+ geom_line(aes(y=mum_wthr_clnd_seasn$tavg, color=mum_wthr_clnd_seasn$Season))+ scale_x_continuous(name ="year", breaks = seq(2015, 2020, 1))
ggplot(mum_wthr_clnd_seasn, aes(x=Year))+ geom_line(aes(y=mum_wthr_clnd_seasn$tmin, color=mum_wthr_clnd_seasn$Season))+ scale_x_continuous(name ="year", breaks = seq(2015, 2020, 1))
ggplot(mum_wthr_clnd_seasn, aes(x=Year))+ geom_line(aes(y=mum_wthr_clnd_seasn$prcp, color=mum_wthr_clnd_seasn$Season))+ scale_x_continuous(name ="year", breaks = seq(2015, 2020, 1))
Pattern of tmin, tmax, tavg, and prcp has changed over years.
Chennai - tavg can be higher during summer or monsoon, precipitation is higher summer and prewinter.
Bangalore -tavg mostly ver high during summer but lower than chennai, precipitation is high during autumn or summer
Mumbai-tavg mostly ver high during summer, precipitation is high during monsoon and autumn
Delhi- tavg can be higher during summer or monsoon, precipitation is high during summer, prewinter and monsoon. It has changed over years.
major_city_station_day_seasonwise<-major_city_station_day%>%mutate(Year=lubridate::year(as.Date(Date)), Season=season(as.Date(Date)))%>%group_by(City,Latitude,Longitude, Elevation,Year, Season)%>% summarise(across(c(2:16), mean, na.rm = TRUE))%>%select(-c(4,18))
## Warning: There were 232 warnings in `summarise()`.
## The first warning was:
## ℹ In argument: `across(c(2:16), mean, na.rm = TRUE)`.
## ℹ In group 1: `City = "Bengaluru"`, `Latitude = 12.97`, `Longitude = 77.59`,
## `Elevation = 920`, `Year = 2015`, `Season = "Autumn"`.
## Caused by warning in `mean.default()`:
## ! argument is not numeric or logical: returning NA
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 231 remaining warnings.
## `summarise()` has grouped output by 'City', 'Latitude', 'Longitude',
## 'Elevation', 'Year'. You can override using the `.groups` argument.
## Adding missing grouping variables: `Elevation`
#major_city_station_day_seasonwise
list_city_wthr<-list(blr_wthr_clnd_seasn, chn_wthr_clnd_seasn, dlhi_wthr_clnd_seasn, mum_wthr_clnd_seasn)
major_city_wthr<-Reduce(function(x,y) merge(x, y, all=TRUE), list_city_wthr)
#major_city_wthr
major_city_AQ_wthr_pos<-merge(major_city_station_day_seasonwise, major_city_wthr, all=TRUE)
summary(major_city_AQ_wthr_pos)
## Elevation City Latitude Longitude
## Min. : 6 Length:134 Min. :12.97 Min. :72.85
## 1st Qu.: 6 Class :character 1st Qu.:12.97 1st Qu.:77.20
## Median :211 Mode :character Median :13.07 Median :77.59
## Mean :307 Mean :18.38 Mean :77.28
## 3rd Qu.:920 3rd Qu.:28.58 3rd Qu.:80.25
## Max. :920 Max. :28.58 Max. :80.25
##
## Year Season Date PM2.5
## Min. :2015 Length:134 Min. : NA Min. : 10.44
## 1st Qu.:2016 Class :character 1st Qu.: NA 1st Qu.: 39.76
## Median :2018 Mode :character Median : NA Median : 55.77
## Mean :2018 Mean :NaN Mean : 74.47
## 3rd Qu.:2019 3rd Qu.: NA 3rd Qu.: 82.84
## Max. :2021 Max. : NA Max. :263.47
## NA's :134 NA's :18
## PM10 NO NO2 NOx
## Min. : 34.84 Min. : 3.580 Min. : 8.771 Min. : 8.84
## 1st Qu.: 77.48 1st Qu.: 8.567 1st Qu.: 17.060 1st Qu.: 19.40
## Median :122.21 Median : 13.289 Median : 30.183 Median : 32.14
## Mean :159.14 Mean : 23.211 Mean : 33.550 Mean : 44.83
## 3rd Qu.:226.52 3rd Qu.: 29.830 3rd Qu.: 47.472 3rd Qu.: 66.37
## Max. :490.09 Max. :111.990 Max. :100.848 Max. :163.85
## NA's :45 NA's :18 NA's :18 NA's :21
## NH3 CO SO2 O3
## Min. : 6.772 Min. :0.3914 Min. : 2.861 Min. : 8.347
## 1st Qu.: 20.937 1st Qu.:0.9369 1st Qu.: 5.887 1st Qu.: 28.562
## Median : 34.818 Median :1.2465 Median : 9.750 Median : 36.013
## Mean : 45.507 Mean :1.7999 Mean :10.867 Mean : 40.403
## 3rd Qu.: 53.762 3rd Qu.:1.8860 3rd Qu.:14.126 3rd Qu.: 45.694
## Max. :225.576 Max. :9.9567 Max. :36.554 Max. :111.430
## NA's :29 NA's :18 NA's :18 NA's :18
## Benzene Xylene AQI AQI_Bucket
## Min. : 0.02893 Min. :0.0000 Min. : 55.14 Min. : NA
## 1st Qu.: 0.71293 1st Qu.:0.4828 1st Qu.: 97.99 1st Qu.: NA
## Median : 1.71125 Median :1.1136 Median :124.06 Median : NA
## Mean : 3.01885 Mean :1.6075 Mean :164.64 Mean :NaN
## 3rd Qu.: 4.22737 3rd Qu.:2.3377 3rd Qu.:193.03 3rd Qu.: NA
## Max. :21.55448 Max. :5.3271 Max. :457.61 Max. : NA
## NA's :19 NA's :107 NA's :18 NA's :134
## tavg tmin tmax prcp
## Min. :12.67 Min. : 8.533 Min. :19.09 Min. : 0.000
## 1st Qu.:22.56 1st Qu.:18.683 1st Qu.:28.39 1st Qu.: 1.060
## Median :26.37 Median :21.792 Median :30.52 Median : 4.703
## Mean :25.25 Mean :20.916 Mean :30.46 Mean : 6.983
## 3rd Qu.:28.65 3rd Qu.:24.303 3rd Qu.:34.02 3rd Qu.: 9.111
## Max. :31.90 Max. :28.950 Max. :39.00 Max. :49.060
## NA's :18 NA's :18 NA's :18 NA's :18
#major_city_AQ_wthr_pos
major_city_AQ_wthr_pos_clnd<-major_city_AQ_wthr_pos[,-c(7,9,13,17,18,20)]
#major_city_AQ_wthr_pos_clnd
City_AQ_WTHR_POS<-major_city_AQ_wthr_pos_clnd
City_AQ_WTHR_POS<-City_AQ_WTHR_POS[!is.na(City_AQ_WTHR_POS$PM2.5),]
#City_AQ_WTHR_POS
ggplot(City_AQ_WTHR_POS, aes(x=Year, y=PM2.5, color=Season)) + geom_boxplot(outlier.colour="black", outlier.shape=16,
outlier.size=2, notch=FALSE) + scale_x_continuous(name ="year", breaks = seq(2015, 2020, 1))
Checked if there PM2.5 outliers in the dataset. There are no outliers.
cityMap <- City_AQ_WTHR_POS %>%
leaflet() %>%
addTiles() %>%
addCircleMarkers(lat = City_AQ_WTHR_POS$Latitude, lng = City_AQ_WTHR_POS$Longitude, weight = 5, color = 'midnightblue', radius = sqrt(City_AQ_WTHR_POS$AQI)) %>%
addPopups(lat = City_AQ_WTHR_POS$Latitude, lng = City_AQ_WTHR_POS$Longitude, popup = paste("PM2.5 ", as.character(round(City_AQ_WTHR_POS$PM2.5, 2)),"\n", "AQI ", as.character(round(City_AQ_WTHR_POS$AQI, 2))))
cityMap
NO, NO2, NOx are considered for analysis as the source of these pollutants and PM2.5 are similar and these pollutants may serve as precursor for PM2.5. Just to limit the scope of analysis only these pollutants and AQI are considered for modeling. To start with their impacts on PM2.5 are studied.
ggplot(City_AQ_WTHR_POS, aes(x = Season, y = PM2.5, color = AQI, size = NO)) +
geom_point(na.rm=TRUE) +
labs(title = "Impact of AQI, Season and NO on PM2.5")
Overall PM2.5 is higher in the order of Autumn, Prewinter and winter with higher AQI and NO. Monsoon are concentration of PM2.5 and NO are lower with low AQI.
ggplot(City_AQ_WTHR_POS, aes(x = Season, y = PM2.5, color = AQI, size = NO2)) +
geom_point(na.rm=TRUE) +
labs(title = "Impact of AQI, Season and NO2 on PM2.5")
Overall PM2.5 is higher in the order of Autumn, Prewinter and winter with higher AQI and NO2. Monsoon are concentration of PM2.5 and NO2 are lower with low AQI.Summer and spring are moderate
ggplot(City_AQ_WTHR_POS, aes(x = Season, y = PM2.5, color = AQI, size = NOx)) +
geom_point(na.rm=TRUE) +
labs(title = "Impact of AQI, Season and NOx on PM2.5")
Overall PM2.5 is higher in the order of Autumn, Prewinter and winter with higher AQI and NOx. Monsoon are concentration of PM2.5 and NOx are lower with low AQI.Summer and spring are moderate
ggplot(City_AQ_WTHR_POS, aes(x = Season, y = PM2.5, color = AQI, size = Elevation)) +
geom_point(na.rm=TRUE) +
labs(title = "Impact of AQI, Season and Elevation on PM2.5")
Higher the elevation PM2.5 is low with low AQI. PM2.5 is generally high during autumns, prewinter and winter. moderate during summer and spring. Monsoons low PM2.5.
ggplot(City_AQ_WTHR_POS, aes(x = Season, y = PM2.5, color = tavg , size = prcp, na.rm=TRUE)) +
geom_point(na.rm=TRUE) +
labs(title = "Impact of Average Temperature, Season and Precipitation on PM2.5")
Higher the temperature and high precipitation leads to low PM2.5.
#city_hour_sepdt
city_hour_test_data<-city_hour_sepdt%>%mutate(Latitude = case_when(City == "Chennai" ~ chn_geo$Latitude, City == "Bengaluru" ~ blr_geo$Latitude, City == "Mumbai" ~ mum_geo$Latitude,City == "Delhi" ~ dlhi_geo$Latitude),Longitude = case_when(City == "Chennai" ~ chn_geo$longitude, City == "Bengaluru" ~ blr_geo$longitude, City == "Mumbai" ~ mum_geo$longitude,City == "Delhi" ~ dlhi_geo$longitude),Elevation = case_when(City == "Chennai" ~ chn_geo$Elevation, City == "Bengaluru" ~ blr_geo$Elevation, City == "Mumbai" ~ mum_geo$Elevation,City == "Delhi" ~ dlhi_geo$Elevation))
major_city_hour_test_data<-city_hour_test_data %>% filter(City %in% c("Chennai", "Bengaluru", "Delhi","Mumbai"))
major_city_hour_test_data
major_city_hour_test_data1<-major_city_hour_test_data%>% group_by(City,Latitude,Longitude, Elevation,Year, Season) %>% summarise(across(c(3:15), mean, na.rm = TRUE))%>%select(-c(12,16,17,18))
## `summarise()` has grouped output by 'City', 'Latitude', 'Longitude',
## 'Elevation', 'Year'. You can override using the `.groups` argument.
#major_city_hour_test_data1
major_city_hour_test_data_temp<-merge(major_city_hour_test_data1, major_city_wthr, all=TRUE)
#major_city_hour_test_data_temp
major_city_hour_test_data_final<-major_city_hour_test_data_temp%>%select(-c(2,3,8,12,13,14,17,18))
#major_city_hour_test_data_final
head(major_city_hour_test_data_final)
11 different with lm and rpart(22 models) separately trained with different explanatory variables. Combined data set of station_day, weather and geo data of Chennai, Bengaluru, Delhi and mumbai is used for training. Combined data with city_day, weather and geo is used as test dataset.
LM Model1: Response Variable: PM2.5 Explanatory Variable(s): NO LM Model2: Response Variable: PM2.5 Explanatory Variable(s): NO + Season LM Model3: Response Variable: PM2.5 Explanatory Variable(s): NO2 + Season LM Model4: Response Variable: PM2.5 Explanatory Variable(s): NOx + Season LM Model5: Response Variable: PM2.5 Explanatory Variable(s): NO+ NO2 + NOx + Season LM Model6: Response Variable: PM2.5 Explanatory Variable(s): NO+ NO2 + NOx + Season + tavg LM Model7: Response Variable: PM2.5 Explanatory Variable(s): NO+ NO2 + NOx + Season + tavg + prcp LM Model8: Response Variable: PM2.5 Explanatory Variable(s): NO+ NO2 + NOx + Season + Elevation LM Model9: Response Variable: PM2.5 Explanatory Variable(s): NO+ NO2 + NOx + Season + tavg + prcp + Elevation LM Model10: Response Variable: PM2.5 Explanatory Variable(s): NO+ NO2 + NOx + Season + tavg + prcp + Elevation + AQI LM Model11: Response Variable: PM2.5 Explanatory Variable(s): AQI
RPART Model1: Response Variable: PM2.5 Explanatory Variable(s): NO RPART Model2: Response Variable: PM2.5 Explanatory Variable(s): NO + Season RPART Model3: Response Variable: PM2.5 Explanatory Variable(s): NO2 + Season RPART Model4: Response Variable: PM2.5 Explanatory Variable(s): NOx + Season RPART Model5: Response Variable: PM2.5 Explanatory Variable(s): NO+ NO2 + NOx + Season RPART Model6: Response Variable: PM2.5 Explanatory Variable(s): NO+ NO2 + NOx + Season + tavg RPART Model7: Response Variable: PM2.5 Explanatory Variable(s): NO+ NO2 + NOx + Season + tavg + prcp RPART Model8: Response Variable: PM2.5 Explanatory Variable(s): NO+ NO2 + NOx + Season + Elevation RPART Model9: Response Variable: PM2.5 Explanatory Variable(s): NO+ NO2 + NOx + Season + tavg + prcp + Elevation RPART Model10: Response Variable: PM2.5 Explanatory Variable(s): NO+ NO2 + NOx + Season + tavg + prcp + Elevation + AQI RPART Model11: Response Variable: PM2.5 Explanatory Variable(s): AQI
lmmodel1_NO<-lm(PM2.5~NO, City_AQ_WTHR_POS)
fmodel(lmmodel1_NO)
rpartmode11_NO<-rpart(PM2.5~NO, data=City_AQ_WTHR_POS, cp=0.01)
fmodel(rpartmode11_NO)
prp(rpartmode11_NO)
lmmodel2_NOSea<-lm(PM2.5~NO+Season, City_AQ_WTHR_POS)
fmodel(lmmodel2_NOSea)
rpartmode12_NOSea<-rpart(PM2.5~NO+Season, data=City_AQ_WTHR_POS, cp=0.01)
fmodel(rpartmode12_NOSea)
prp(rpartmode12_NOSea)
lmmodel3_NO2Sea<-lm(PM2.5~NO2+Season, City_AQ_WTHR_POS)
fmodel(lmmodel3_NO2Sea)
rpartmode13_NO2Sea<-rpart(PM2.5~NO2+Season, data=City_AQ_WTHR_POS, cp=0.01)
fmodel(rpartmode13_NO2Sea)
prp(rpartmode13_NO2Sea)
lmmodel4_NOXSea<-lm(PM2.5~NOx+Season, City_AQ_WTHR_POS)
fmodel(lmmodel4_NOXSea)
rpartmode14_NOXSea<-rpart(PM2.5~NOx+Season, data=City_AQ_WTHR_POS, cp=0.01)
fmodel(rpartmode14_NOXSea)
prp(rpartmode14_NOXSea)
lmmodel5_NOCombSea<-lm(PM2.5~NO+NO2+NOx+Season, City_AQ_WTHR_POS)
fmodel(lmmodel5_NOCombSea)
rpartmode15_NOCombSea<-rpart(PM2.5~NO+NO2+NOx+Season, data=City_AQ_WTHR_POS, cp=0.01)
fmodel(rpartmode15_NOCombSea)
prp(rpartmode15_NOCombSea)
lmmodel6_NOTempCombSea<-lm(PM2.5~NO+NO2+NOx+Season+tavg, City_AQ_WTHR_POS)
fmodel(lmmodel6_NOTempCombSea)
rpartmode16_NOTempCombSea<-rpart(PM2.5~NO+NO2+NOx+Season+tavg, data=City_AQ_WTHR_POS, cp=0.01)
fmodel(rpartmode16_NOTempCombSea)
prp(rpartmode16_NOTempCombSea)
lmmodel7_NOTempCombPrcPSea<-lm(PM2.5~NO+NO2+NOx+Season+tavg+prcp, City_AQ_WTHR_POS)
fmodel(lmmodel7_NOTempCombPrcPSea)
rpartmodel7_NOTempCombPrcPSea<-rpart(PM2.5~NO+NO2+NOx+Season+tavg+prcp, data=City_AQ_WTHR_POS, cp=0.01)
fmodel(rpartmodel7_NOTempCombPrcPSea)
prp(rpartmodel7_NOTempCombPrcPSea)
lmmodel8_NOCombElevPSea<-lm(PM2.5~NO+NO2+NOx+Season+Elevation, City_AQ_WTHR_POS)
fmodel(lmmodel8_NOCombElevPSea)
rpartmodel8_NOCombElevPSea<-rpart(PM2.5~NO+NO2+NOx+NOx+Season+Elevation, data=City_AQ_WTHR_POS, cp=0.01)
fmodel(rpartmodel8_NOCombElevPSea)
prp(rpartmodel8_NOCombElevPSea)
lmmodel9_NOTempCombElevPrcpSea<-lm(PM2.5~NO+NO2+NOx+Season+tavg+prcp+Elevation, City_AQ_WTHR_POS)
fmodel(lmmodel9_NOTempCombElevPrcpSea)
rpartmodel9_NOTempCombElevPrcpSea<-rpart(PM2.5~NO+NO2+NOx+Season+tavg+prcp+Elevation, data=City_AQ_WTHR_POS, cp=0.01)
fmodel(rpartmodel9_NOTempCombElevPrcpSea)
prp(rpartmodel9_NOTempCombElevPrcpSea)
lmmodel10_NOTempCombElevPrcpSeaAQI<-lm(PM2.5~NO+NO2+NOx+Season+tavg+prcp+Elevation+AQI, City_AQ_WTHR_POS)
fmodel(lmmodel10_NOTempCombElevPrcpSeaAQI)
rpartmodel10_NOTempCombElevPrcpSeaAQI<-rpart(PM2.5~NO+NO2+NOx+Season+tavg+prcp+Elevation+AQI, data=City_AQ_WTHR_POS, cp=0.01)
fmodel(rpartmodel10_NOTempCombElevPrcpSeaAQI)
prp(rpartmodel10_NOTempCombElevPrcpSeaAQI)
lmmodel11_AQI<-lm(PM2.5~AQI, City_AQ_WTHR_POS)
fmodel(lmmodel11_AQI)
rpartmodel11_AQI<-rpart(PM2.5~AQI, data=City_AQ_WTHR_POS, cp=0.01)
fmodel(rpartmodel11_AQI)
prp(rpartmodel11_AQI)
lmmodel1_output <- data.frame(evaluate_model(lmmodel1_NO, City_AQ_WTHR_POS))%>%select(PM2.5, model_output)
lmmodel1_output<-lmmodel1_output %>% mutate(diff =(lmmodel1_output$PM2.5 - lmmodel1_output$model_output))
lmmodel1_MSE<-mean((lmmodel1_output$diff)^2, na.rm=TRUE)
rpartmodel1_output <- evaluate_model(rpartmode11_NO, City_AQ_WTHR_POS)%>%select(PM2.5, model_output)
rpartmodel1_output<-rpartmodel1_output%>%mutate(diff =(rpartmodel1_output$PM2.5-rpartmodel1_output$model_output))
rpartmodel1_MSE<-mean((rpartmodel1_output$diff)^2, na.rm=TRUE)
lmmodel2_output <- data.frame(evaluate_model(lmmodel2_NOSea, City_AQ_WTHR_POS))%>%select(PM2.5, model_output)
lmmodel2_output<-lmmodel2_output %>% mutate(diff =(lmmodel2_output$PM2.5 - lmmodel2_output$model_output))
lmmodel2_MSE<-mean((lmmodel2_output$diff)^2, na.rm=TRUE)
rpartmodel2_output <- evaluate_model(rpartmode12_NOSea, City_AQ_WTHR_POS)%>%select(PM2.5, model_output)
rpartmodel2_output<-rpartmodel2_output%>%mutate(diff =(rpartmodel2_output$PM2.5-rpartmodel2_output$model_output))
rpartmodel2_MSE<-mean((rpartmodel2_output$diff)^2, na.rm=TRUE)
lmmodel3_output <- data.frame(evaluate_model(lmmodel3_NO2Sea, City_AQ_WTHR_POS))%>%select(PM2.5, model_output)
lmmodel3_output<-lmmodel3_output %>% mutate(diff =(lmmodel3_output$PM2.5 - lmmodel3_output$model_output))
lmmodel3_MSE<-mean((lmmodel3_output$diff)^2, na.rm=TRUE)
rpartmodel3_output <- evaluate_model(rpartmode13_NO2Sea, City_AQ_WTHR_POS)%>%select(PM2.5, model_output)
rpartmodel3_output<-rpartmodel3_output%>%mutate(diff =(rpartmodel3_output$PM2.5-rpartmodel3_output$model_output))
rpartmodel3_MSE<-mean((rpartmodel3_output$diff)^2, na.rm=TRUE)
lmmodel4_output <- data.frame(evaluate_model(lmmodel4_NOXSea, City_AQ_WTHR_POS))%>%select(PM2.5, model_output)
lmmodel4_output<-lmmodel4_output %>% mutate(diff =(lmmodel4_output$PM2.5 - lmmodel4_output$model_output))
lmmodel4_MSE<-mean((lmmodel4_output$diff)^2, na.rm=TRUE)
rpartmodel4_output <- evaluate_model(rpartmode14_NOXSea, City_AQ_WTHR_POS)%>%select(PM2.5, model_output)
rpartmodel4_output<-rpartmodel4_output%>%mutate(diff =(rpartmodel4_output$PM2.5-rpartmodel4_output$model_output))
rpartmodel4_MSE<-mean((rpartmodel4_output$diff)^2, na.rm=TRUE)
lmmodel5_output <- data.frame(evaluate_model(lmmodel5_NOCombSea, City_AQ_WTHR_POS))%>%select(PM2.5, model_output)
lmmodel5_output<-lmmodel5_output %>% mutate(diff =(lmmodel5_output$PM2.5 - lmmodel5_output$model_output))
lmmodel5_MSE<-mean((lmmodel5_output$diff)^2, na.rm=TRUE)
rpartmodel5_output <- evaluate_model(rpartmode15_NOCombSea, City_AQ_WTHR_POS)%>%select(PM2.5, model_output)
rpartmodel5_output<-rpartmodel5_output%>%mutate(diff =(rpartmodel5_output$PM2.5-rpartmodel5_output$model_output))
rpartmodel5_MSE<-mean((rpartmodel5_output$diff)^2, na.rm=TRUE)
print(paste("MSE of rpart model trained with only NO + NO2 + NOx + Season for training data set", rpartmodel5_MSE))
## [1] "MSE of rpart model trained with only NO + NO2 + NOx + Season for training data set 710.426155551455"
lmmodel6_output <- data.frame(evaluate_model(lmmodel6_NOTempCombSea, City_AQ_WTHR_POS))%>%select(PM2.5, model_output)
lmmodel6_output<-lmmodel6_output %>% mutate(diff =(lmmodel6_output$PM2.5 - lmmodel6_output$model_output))
lmmodel6_MSE<-mean((lmmodel6_output$diff)^2, na.rm=TRUE)
rpartmodel6_output <- evaluate_model(rpartmode16_NOTempCombSea, City_AQ_WTHR_POS)%>%select(PM2.5, model_output)
rpartmodel6_output<-rpartmodel6_output%>%mutate(diff =(rpartmodel6_output$PM2.5-rpartmodel6_output$model_output))
rpartmodel6_MSE<-mean((rpartmodel6_output$diff)^2, na.rm=TRUE)
lmmodel7_output <- data.frame(evaluate_model(lmmodel7_NOTempCombPrcPSea, City_AQ_WTHR_POS))%>%select(PM2.5, model_output)
lmmodel7_output<-lmmodel7_output %>% mutate(diff =(lmmodel7_output$PM2.5 - lmmodel7_output$model_output))
lmmodel7_MSE<-mean((lmmodel7_output$diff)^2, na.rm=TRUE)
rpartmodel7_output <- evaluate_model(rpartmodel7_NOTempCombPrcPSea, City_AQ_WTHR_POS)%>%select(PM2.5, model_output)
rpartmodel7_output<-rpartmodel7_output%>%mutate(diff =(rpartmodel7_output$PM2.5-rpartmodel7_output$model_output))
rpartmodel7_MSE<-mean((rpartmodel7_output$diff)^2, na.rm=TRUE)
lmmodel8_output <- data.frame(evaluate_model(lmmodel8_NOCombElevPSea, City_AQ_WTHR_POS))%>%select(PM2.5, model_output)
lmmodel8_output<-lmmodel8_output %>% mutate(diff =(lmmodel8_output$PM2.5 - lmmodel8_output$model_output))
lmmodel8_MSE<-mean((lmmodel8_output$diff)^2, na.rm=TRUE)
rpartmodel8_output <- evaluate_model(rpartmodel8_NOCombElevPSea, City_AQ_WTHR_POS)%>%select(PM2.5, model_output)
rpartmodel8_output<-rpartmodel8_output%>%mutate(diff =(rpartmodel8_output$PM2.5-rpartmodel8_output$model_output))
rpartmodel8_MSE<-mean((rpartmodel8_output$diff)^2, na.rm=TRUE)
lmmodel9_output <- data.frame(evaluate_model(lmmodel9_NOTempCombElevPrcpSea, City_AQ_WTHR_POS))%>%select(PM2.5, model_output)
lmmodel9_output<-lmmodel9_output %>% mutate(diff =(lmmodel9_output$PM2.5 - lmmodel9_output$model_output))
lmmodel9_MSE<-mean((lmmodel9_output$diff)^2, na.rm=TRUE)
rpartmodel9_output <- evaluate_model(rpartmodel9_NOTempCombElevPrcpSea, City_AQ_WTHR_POS)%>%select(PM2.5, model_output)
rpartmodel9_output<-rpartmodel9_output%>%mutate(diff =(rpartmodel9_output$PM2.5-rpartmodel9_output$model_output))
rpartmodel9_MSE<-mean((rpartmodel9_output$diff)^2, na.rm=TRUE)
lmmodel10_output <- data.frame(evaluate_model(lmmodel10_NOTempCombElevPrcpSeaAQI, City_AQ_WTHR_POS))%>%select(PM2.5, model_output)
lmmodel10_output<-lmmodel10_output %>% mutate(diff =(lmmodel10_output$PM2.5 - lmmodel10_output$model_output))
lmmodel10_MSE<-mean((lmmodel10_output$diff)^2, na.rm=TRUE)
rpartmodel10_output <- evaluate_model(rpartmodel10_NOTempCombElevPrcpSeaAQI, City_AQ_WTHR_POS)%>%select(PM2.5, model_output)
rpartmodel10_output<-rpartmodel10_output%>%mutate(diff =(rpartmodel10_output$PM2.5-rpartmodel10_output$model_output))
rpartmodel10_MSE<-mean((rpartmodel10_output$diff)^2, na.rm=TRUE)
lmmodel11_output <- data.frame(evaluate_model(lmmodel11_AQI, City_AQ_WTHR_POS))%>%select(PM2.5, model_output)
lmmodel11_output<-lmmodel11_output %>% mutate(diff =(lmmodel11_output$PM2.5 - lmmodel11_output$model_output))
lmmodel11_MSE<-mean((lmmodel11_output$diff)^2, na.rm=TRUE)
rpartmodel11_output <- evaluate_model(rpartmodel11_AQI, City_AQ_WTHR_POS)%>%select(PM2.5, model_output)
rpartmodel11_output<-rpartmodel11_output%>%mutate(diff =(rpartmodel11_output$PM2.5-rpartmodel11_output$model_output))
rpartmodel11_MSE<-mean((rpartmodel11_output$diff)^2, na.rm=TRUE)
print(paste("MSE of lm model trained with only NO for training data set", lmmodel1_MSE))
## [1] "MSE of lm model trained with only NO for training data set 1061.33499331602"
print(paste("MSE of rpart model trained with only NO for training data set", rpartmodel1_MSE))
## [1] "MSE of rpart model trained with only NO for training data set 987.1894333002"
print(paste("MSE of lm model trained with only NO + Season for training data set", lmmodel2_MSE))
## [1] "MSE of lm model trained with only NO + Season for training data set 918.333727467041"
print(paste("MSE of rpart model trained with only NO + Season for training data set", rpartmodel2_MSE))
## [1] "MSE of rpart model trained with only NO + Season for training data set 876.496881540318"
print(paste("MSE of lm model trained with only NO2 + Season for training data set", lmmodel3_MSE))
## [1] "MSE of lm model trained with only NO2 + Season for training data set 1340.46177198892"
print(paste("MSE of rpart model trained with only NO2 + Season for training data set", rpartmodel3_MSE))
## [1] "MSE of rpart model trained with only NO2 + Season for training data set 776.716479521588"
print(paste("MSE of lm model trained with only NOX + Season for training data set", lmmodel4_MSE))
## [1] "MSE of lm model trained with only NOX + Season for training data set 1412.41811010542"
print(paste("MSE of rpart model trained with only NOX + Season for training data set", rpartmodel4_MSE))
## [1] "MSE of rpart model trained with only NOX + Season for training data set 1249.88681402177"
print(paste("MSE of lm model trained with only NO + NO2 + NOx + Season for training data set", lmmodel5_MSE))
## [1] "MSE of lm model trained with only NO + NO2 + NOx + Season for training data set 802.808464115074"
print(paste("MSE of lm model trained with only NO + NO2 + NOx + tempavg + Season for training data set", lmmodel6_MSE))
## [1] "MSE of lm model trained with only NO + NO2 + NOx + tempavg + Season for training data set 631.673468162641"
print(paste("MSE of rpart model trained with only NO + NO2 + NOx + tempavg + Season for training data set", rpartmodel6_MSE))
## [1] "MSE of rpart model trained with only NO + NO2 + NOx + tempavg + Season for training data set 727.900589372769"
print(paste("MSE of lm model trained with only NO + NO2 + NOx + tempavg + Precipitation + Season for training data set", lmmodel7_MSE))
## [1] "MSE of lm model trained with only NO + NO2 + NOx + tempavg + Precipitation + Season for training data set 617.745498296348"
print(paste("MSE of rpart model trained with only NO + NO2 + NOx + tempavg + Precipitation + Season for training data set", rpartmodel7_MSE))
## [1] "MSE of rpart model trained with only NO + NO2 + NOx + tempavg + Precipitation + Season for training data set 727.900589372769"
print(paste("MSE of lm model trained with only NO + NO2 + NOx + Elevation + Season for training data set", lmmodel8_MSE))
## [1] "MSE of lm model trained with only NO + NO2 + NOx + Elevation + Season for training data set 720.697686496395"
print(paste("MSE of rpart model trained with only NO + NO2 + NOx + Elevation + Season for training data set", rpartmodel8_MSE))
## [1] "MSE of rpart model trained with only NO + NO2 + NOx + Elevation + Season for training data set 710.426155551455"
print(paste("MSE of lm model trained with only NO + NO2 + NOx + Temp avg + Precipitation + Elevation + Season for training data set", lmmodel9_MSE))
## [1] "MSE of lm model trained with only NO + NO2 + NOx + Temp avg + Precipitation + Elevation + Season for training data set 462.738773231392"
print(paste("MSE of rpart model trained with only NO + NO2 + NOx + Temp avg + Precipitation + Elevation + Season for training data set", rpartmodel9_MSE))
## [1] "MSE of rpart model trained with only NO + NO2 + NOx + Temp avg + Precipitation + Elevation + Season for training data set 717.825390944258"
print(paste("MSE of lm model trained with only NO + NO2 + NOx + Temp avg + Precipitation + Elevation + Season + AQI for training data set", lmmodel10_MSE))
## [1] "MSE of lm model trained with only NO + NO2 + NOx + Temp avg + Precipitation + Elevation + Season + AQI for training data set 106.871067357215"
print(paste("MSE of rpart model trained with only NO + NO2 + NOx + Temp avg + Precipitation + Elevation + Season + AQI for training data set", rpartmodel10_MSE))
## [1] "MSE of rpart model trained with only NO + NO2 + NOx + Temp avg + Precipitation + Elevation + Season + AQI for training data set 212.265984562307"
print(paste("MSE of lm model trained with only AQI for training data set", lmmodel11_MSE))
## [1] "MSE of lm model trained with only AQI for training data set 210.920482258775"
print(paste("MSE of rpart model trained with only AQI for training data set", rpartmodel11_MSE))
## [1] "MSE of rpart model trained with only AQI for training data set 212.265984562307"
lmmodel1_output <- data.frame(evaluate_model(lmmodel1_NO, major_city_hour_test_data_final))%>%select(PM2.5, model_output)
lmmodel1_output<-lmmodel1_output %>% mutate(diff =(lmmodel1_output$PM2.5 - lmmodel1_output$model_output))
lmmodel1_MSE<-mean((lmmodel1_output$diff)^2,na.rm=TRUE)
rpartmodel1_output <- evaluate_model(rpartmode11_NO, major_city_hour_test_data_final)%>%select(PM2.5, model_output)
rpartmodel1_output<-rpartmodel1_output%>%mutate(diff =(rpartmodel1_output$PM2.5-rpartmodel1_output$model_output))
rpartmodel1_MSE<-mean((rpartmodel1_output$diff)^2,na.rm=TRUE)
lmmodel2_output <- data.frame(evaluate_model(lmmodel2_NOSea, major_city_hour_test_data_final))%>%select(PM2.5, model_output)
lmmodel2_output<-lmmodel2_output %>% mutate(diff =(lmmodel2_output$PM2.5 - lmmodel2_output$model_output))
lmmodel2_MSE<-mean((lmmodel2_output$diff)^2,na.rm=TRUE)
rpartmodel2_output <- evaluate_model(rpartmode12_NOSea, major_city_hour_test_data_final)%>%select(PM2.5, model_output)
rpartmodel2_output<-rpartmodel2_output%>%mutate(diff =(rpartmodel2_output$PM2.5-rpartmodel2_output$model_output))
rpartmodel2_MSE<-mean((rpartmodel2_output$diff)^2,na.rm=TRUE)
lmmodel3_output <- data.frame(evaluate_model(lmmodel3_NO2Sea, major_city_hour_test_data_final))%>%select(PM2.5, model_output)
lmmodel3_output<-lmmodel3_output %>% mutate(diff =(lmmodel3_output$PM2.5 - lmmodel3_output$model_output))
lmmodel3_MSE<-mean((lmmodel3_output$diff)^2,na.rm=TRUE)
rpartmodel3_output <- evaluate_model(rpartmode13_NO2Sea, major_city_hour_test_data_final)%>%select(PM2.5, model_output)
rpartmodel3_output<-rpartmodel3_output%>%mutate(diff =(rpartmodel3_output$PM2.5-rpartmodel3_output$model_output))
rpartmodel3_MSE<-mean((rpartmodel3_output$diff)^2,na.rm=TRUE)
lmmodel4_output <- data.frame(evaluate_model(lmmodel4_NOXSea, major_city_hour_test_data_final))%>%select(PM2.5, model_output)
lmmodel4_output<-lmmodel4_output %>% mutate(diff =(lmmodel4_output$PM2.5 - lmmodel4_output$model_output))
lmmodel4_MSE<-mean((lmmodel4_output$diff)^2,na.rm=TRUE)
rpartmodel4_output <- evaluate_model(rpartmode14_NOXSea, major_city_hour_test_data_final)%>%select(PM2.5, model_output)
rpartmodel4_output<-rpartmodel4_output%>%mutate(diff =(rpartmodel4_output$PM2.5-rpartmodel4_output$model_output))
rpartmodel4_MSE<-mean((rpartmodel4_output$diff)^2,na.rm=TRUE)
lmmodel5_output <- data.frame(evaluate_model(lmmodel5_NOCombSea, major_city_hour_test_data_final))%>%select(PM2.5, model_output)
lmmodel5_output<-lmmodel5_output %>% mutate(diff =(lmmodel5_output$PM2.5 - lmmodel5_output$model_output))
lmmodel5_MSE<-mean((lmmodel5_output$diff)^2,na.rm=TRUE)
rpartmodel5_output <- evaluate_model(rpartmode15_NOCombSea, major_city_hour_test_data_final)%>%select(PM2.5, model_output)
rpartmodel5_output<-rpartmodel5_output%>%mutate(diff =(rpartmodel5_output$PM2.5-rpartmodel5_output$model_output))
rpartmodel5_MSE<-mean((rpartmodel5_output$diff)^2,na.rm=TRUE)
print(paste("MSE of rpart model trained with only NO + NO2 + NOx + Season for test data set", rpartmodel5_MSE))
## [1] "MSE of rpart model trained with only NO + NO2 + NOx + Season for test data set 770.051250699718"
lmmodel6_output <- data.frame(evaluate_model(lmmodel6_NOTempCombSea, major_city_hour_test_data_final))%>%select(PM2.5, model_output)
lmmodel6_output<-lmmodel6_output %>% mutate(diff =(lmmodel6_output$PM2.5 - lmmodel6_output$model_output))
lmmodel6_MSE<-mean((lmmodel6_output$diff)^2,na.rm=TRUE)
rpartmodel6_output <- evaluate_model(rpartmode16_NOTempCombSea, major_city_hour_test_data_final)%>%select(PM2.5, model_output)
rpartmodel6_output<-rpartmodel6_output%>%mutate(diff =(rpartmodel6_output$PM2.5-rpartmodel6_output$model_output))
rpartmodel6_MSE<-mean((rpartmodel6_output$diff)^2,na.rm=TRUE)
lmmodel7_output <- data.frame(evaluate_model(lmmodel7_NOTempCombPrcPSea, major_city_hour_test_data_final))%>%select(PM2.5, model_output)
lmmodel7_output<-lmmodel7_output %>% mutate(diff =(lmmodel7_output$PM2.5 - lmmodel7_output$model_output))
lmmodel7_MSE<-mean((lmmodel7_output$diff)^2,na.rm=TRUE)
rpartmodel7_output <- evaluate_model(rpartmodel7_NOTempCombPrcPSea, major_city_hour_test_data_final)%>%select(PM2.5, model_output)
rpartmodel7_output<-rpartmodel7_output%>%mutate(diff =(rpartmodel7_output$PM2.5-rpartmodel7_output$model_output))
rpartmodel7_MSE<-mean((rpartmodel7_output$diff)^2,na.rm=TRUE)
lmmodel8_output <- data.frame(evaluate_model(lmmodel8_NOCombElevPSea, major_city_hour_test_data_final))%>%select(PM2.5, model_output)
lmmodel8_output<-lmmodel8_output %>% mutate(diff =(lmmodel8_output$PM2.5 - lmmodel8_output$model_output))
lmmodel8_MSE<-mean((lmmodel8_output$diff)^2,na.rm=TRUE)
rpartmodel8_output <- evaluate_model(rpartmodel8_NOCombElevPSea, major_city_hour_test_data_final)%>%select(PM2.5, model_output)
rpartmodel8_output<-rpartmodel8_output%>%mutate(diff =(rpartmodel8_output$PM2.5-rpartmodel8_output$model_output))
rpartmodel8_MSE<-mean((rpartmodel8_output$diff)^2,na.rm=TRUE)
lmmodel9_output <- data.frame(evaluate_model(lmmodel9_NOTempCombElevPrcpSea, major_city_hour_test_data_final))%>%select(PM2.5, model_output)
lmmodel9_output<-lmmodel9_output %>% mutate(diff =(lmmodel9_output$PM2.5 - lmmodel9_output$model_output))
lmmodel9_MSE<-mean((lmmodel9_output$diff)^2,na.rm=TRUE)
rpartmodel9_output <- evaluate_model(rpartmodel9_NOTempCombElevPrcpSea, major_city_hour_test_data_final)%>%select(PM2.5, model_output)
rpartmodel9_output<-rpartmodel9_output%>%mutate(diff =(rpartmodel9_output$PM2.5-rpartmodel9_output$model_output))
rpartmodel9_MSE<-mean((rpartmodel9_output$diff)^2,na.rm=TRUE)
lmmodel10_output <- data.frame(evaluate_model(lmmodel10_NOTempCombElevPrcpSeaAQI, major_city_hour_test_data_final))%>%select(PM2.5, model_output)
lmmodel10_output<-lmmodel10_output %>% mutate(diff =(lmmodel10_output$PM2.5 - lmmodel10_output$model_output))
lmmodel10_MSE<-mean((lmmodel10_output$diff)^2, na.rm=TRUE)
rpartmodel10_output <- evaluate_model(rpartmodel10_NOTempCombElevPrcpSeaAQI, major_city_hour_test_data_final)%>%select(PM2.5, model_output)
rpartmodel10_output<-rpartmodel10_output%>%mutate(diff =(rpartmodel10_output$PM2.5-rpartmodel10_output$model_output))
rpartmodel10_MSE<-mean((rpartmodel10_output$diff)^2, na.rm=TRUE)
lmmodel11_output <- data.frame(evaluate_model(lmmodel11_AQI, major_city_hour_test_data_final))%>%select(PM2.5, model_output)
lmmodel11_output<-lmmodel11_output %>% mutate(diff =(lmmodel11_output$PM2.5 - lmmodel11_output$model_output))
lmmodel11_MSE<-mean((lmmodel11_output$diff)^2, na.rm=TRUE)
rpartmodel11_output <- evaluate_model(rpartmodel11_AQI, major_city_hour_test_data_final)%>%select(PM2.5, model_output)
rpartmodel11_output<-rpartmodel11_output%>%mutate(diff =(rpartmodel11_output$PM2.5-rpartmodel11_output$model_output))
rpartmodel11_MSE<-mean((rpartmodel11_output$diff)^2, na.rm=TRUE)
print(paste("MSE of lm model trained with only NO for test data set", lmmodel1_MSE))
## [1] "MSE of lm model trained with only NO for test data set 1027.64415372642"
print(paste("MSE of rpart model trained with only NO for test data set", rpartmodel1_MSE))
## [1] "MSE of rpart model trained with only NO for test data set 958.547982922875"
print(paste("MSE of lm model trained with only NO + Season for test data set", lmmodel2_MSE))
## [1] "MSE of lm model trained with only NO + Season for test data set 896.715308139762"
print(paste("MSE of rpart model trained with only NO + Season for test data set", rpartmodel2_MSE))
## [1] "MSE of rpart model trained with only NO + Season for test data set 985.077280399531"
print(paste("MSE of lm model trained with only NO2 + Season for test data set", lmmodel3_MSE))
## [1] "MSE of lm model trained with only NO2 + Season for test data set 1187.39887983887"
print(paste("MSE of rpart model trained with only NO2 + Season for test data set", rpartmodel3_MSE))
## [1] "MSE of rpart model trained with only NO2 + Season for test data set 819.395413552255"
print(paste("MSE of lm model trained with only NOX + Season for test data set", lmmodel4_MSE))
## [1] "MSE of lm model trained with only NOX + Season for test data set 1778.98277381702"
print(paste("MSE of rpart model trained with only NOX + Season for test data set", rpartmodel4_MSE))
## [1] "MSE of rpart model trained with only NOX + Season for test data set 1877.07924764811"
print(paste("MSE of lm model trained with only NO + NO2 + NOx + Season for test data set", lmmodel5_MSE))
## [1] "MSE of lm model trained with only NO + NO2 + NOx + Season for test data set 1102.21594202083"
print(paste("MSE of lm model trained with only NO + NO2 + NOx + tempavg + Season for test data set", lmmodel6_MSE))
## [1] "MSE of lm model trained with only NO + NO2 + NOx + tempavg + Season for test data set 688.910782291019"
print(paste("MSE of rpart model trained with only NO + NO2 + NOx + tempavg + Season for test data set", rpartmodel6_MSE))
## [1] "MSE of rpart model trained with only NO + NO2 + NOx + tempavg + Season for test data set 776.144394669279"
print(paste("MSE of lm model trained with only NO + NO2 + NOx + tempavg + Precipitation + Season for test data set", lmmodel7_MSE))
## [1] "MSE of lm model trained with only NO + NO2 + NOx + tempavg + Precipitation + Season for test data set 671.858921328259"
print(paste("MSE of rpart model trained with only NO + NO2 + NOx + tempavg + Precipitation + Season for test data set", rpartmodel7_MSE))
## [1] "MSE of rpart model trained with only NO + NO2 + NOx + tempavg + Precipitation + Season for test data set 776.144394669279"
print(paste("MSE of lm model trained with only NO + NO2 + NOx + Elevation + Season for test data set", lmmodel8_MSE))
## [1] "MSE of lm model trained with only NO + NO2 + NOx + Elevation + Season for test data set 1129.76725458874"
print(paste("MSE of rpart model trained with only NO + NO2 + NOx + Elevation + Season for test data set", rpartmodel8_MSE))
## [1] "MSE of rpart model trained with only NO + NO2 + NOx + Elevation + Season for test data set 770.051250699718"
print(paste("MSE of lm model trained with only NO + NO2 + NOx + Temp avg + Precipitation + Elevation + Season for test data set", lmmodel9_MSE))
## [1] "MSE of lm model trained with only NO + NO2 + NOx + Temp avg + Precipitation + Elevation + Season for test data set 508.743362701904"
print(paste("MSE of rpart model trained with only NO + NO2 + NOx + Temp avg + Precipitation + Elevation + Season for test data set", rpartmodel9_MSE))
## [1] "MSE of rpart model trained with only NO + NO2 + NOx + Temp avg + Precipitation + Elevation + Season for test data set 765.364054382867"
print(paste("MSE of lm model trained with only NO + NO2 + NOx + Temp avg + Precipitation + Elevation + Season + AQI for test data set", lmmodel10_MSE))
## [1] "MSE of lm model trained with only NO + NO2 + NOx + Temp avg + Precipitation + Elevation + Season + AQI for test data set 188.852974998694"
print(paste("MSE of rpart model trained with only NO + NO2 + NOx + Temp avg + Precipitation + Elevation + Season + AQI for test data set", rpartmodel10_MSE))
## [1] "MSE of rpart model trained with only NO + NO2 + NOx + Temp avg + Precipitation + Elevation + Season + AQI for test data set 247.40355272891"
print(paste("MSE of lm model trained with only AQI for test data set", lmmodel11_MSE))
## [1] "MSE of lm model trained with only AQI for test data set 205.748333922921"
print(paste("MSE of rpart model trained with only AQI for test data set", rpartmodel11_MSE))
## [1] "MSE of rpart model trained with only AQI for test data set 247.40355272891"
Outcomes for the objectives and analysis performed.
##1. Benford’s law Conformity Check: Benford’s law conformity check function was implemented with basic functions. Comparing the actual probability of each leading digits with Benford’s probability. Due to incompleteness of the dataset available, most of the data set didnt confirm to Benfordness. station_day dataset had few of the pollutants data with close or acceptable confirmity.
##2. AQI of Bengaluru, Chennai, Delhi and Mumbai for the period 2015-2020: With city_day dataset, averaged over months and plotted AQI for over different months over the years. Using graph visualization, drawn inferences.
##3. Analysis of different pollutants for major cities over different years: With city_day dataset, plotted pollutants for different years. With graphs, derived the inferences.
##4. Analysis of different pollutants across different Indian seasons over the day during different hours: With station_day cleaned and benfordness checked dataset, plotted graphs for chennai and Bengaluru over different hours for the day and derived inferences for pollutants with different graphs
##5. Understanding the Impact of different pollutants, temperature, season, precipitation on PM2.5: Created a combined data set with cleaned station_day, wether and geodata plotted different graphs to study the relation or impact other variables over PM2.5.Able to derive clear inferences.
##6. Building different prediction model for PM2.5 using key pollutants, weather info and season for major cities using lm and rpart modeling: Built 22 different models using lm and rpart with different set explanatory variables set (1. NO, 2. NO + Season,3.NO2 + Season, 4.NOx + Season, 5.NO+ NO2 + NOx + Season, 6.NO+ NO2 + NOx + Season + tavg, 7.NO+ NO2 + NOx + Season + tavg + prcp,8. NO+ NO2 + NOx + Season + Elevation, 9.NO+ NO2 + NOx + Season + tavg + prcp + Elevation, 10.NO+ NO2 + NOx + Season + tavg + prcp + Elevation + AQI, 11.AQI). The model is evaluated using both training data set and test set. The test data was prepared by combining cleaned city_day, wethear and geo info of major cities. MSE error for each of the model prediction calculated and stored for comparison. fmodel and prp graphs were plotted for study.
##7. Evaluation and comparison of performance of different models and selecting a optimal one: Comparing all the model results and MSE of each of the 22 model trained using LM and RPART, the model trained using explanatory variables NO+ NO2 + NOx + Season + tavg + prcp + Elevation + AQ yield better results. The below graph show actual vs prediction for both lm model and rpart model. lm model had MSE of 188.852974998694 for test data set and 106.871067357215 for training dataset.
#lmmodel10_output
ggplot(lmmodel10_output, aes(x=model_output, y= PM2.5)) +
geom_point(na.rm=TRUE) +
geom_abline(intercept=0, slope=1) +
labs(x='Predicted Values', y='Actual Values', title='LMPredictedvsActual\nNO+NO2+NOx+Tempavg+Precipitation+Elevation+Season+AQI')
ggplot(rpartmodel10_output, aes(x=model_output, y= PM2.5)) +
geom_point(na.rm=TRUE) +
geom_abline(intercept=0, slope=1) +
labs(x='Predicted Values', y='Actual Values', title='RPARTPredictedvsActual\nNO+NO2+NOx+Tempavg+Precipitation+Elevation+Season+AQI')
Results and inferences
##1. Benford’s law Conformity Check: PM2.5, AQI and few other pollutants data from city_day and station_day was checked for Benfordness.PM2.5 from station_day dataset had acceptable conformity and NO from station_day had close conformity. Rest of the data failed the Benfordness check as there were lot of missing info. This could have impacted the modeling. Hence used station_day dataset for training as it was better than city_day data. Used city_day dataset as test set.
##2. AQI of Bengaluru, Chennai, Delhi and Mumbai for the period 2015-2020: AQI over different months for major cities for the years(2015-2020) were highly varying. Delhi had very high AQI compared to other cities. Chennai and Bangalore almost similar AQI. The AQI is high during start of the year and really high during end of the year. Across different cities the common observation over the given years is that the AQI has improved for the respective city AQI benchmarks compare to past. The AQI is been varying across different months. This clearly indicated that seasonal changes and weather has impact on the AQI.
##3. Analysis of different pollutants for major cities over different years: Delhi has high AQI and higher concentration of pollutants for the given period. Bangalore has low concentration of PM2.5 and SO2. Mumbai has higher concentraion of NO, NO2 and NOx than Chennai and Bangalore. Chennai PM2.5 and other pollutant concentraion is neither high or less compared to other cities. CO concentration has reduced in all cities over the years. Mumbai has lowest O3 concentration. PM2.5 data Mumbai doesnt appear to be reliable as is unusually low.
##4. Analysis of different pollutants across different Indian seasons over the day during different hours: Chennai has low PM2.5 concentration during summer whereas Bengaluru has low concentration during monsoon. PM2.5 is high during morning hours till afternoon. Chennai has high O3 concentration during monsoon and summer whereas Bengaluru has during Spring and prewinter. O3 is high through the day for chennai but for Bengaluru peaks durign mid night and afternoon. For both Chennai and Bengaluru, CO is very high during monsoon and high during winter and it peaks first half of the day and drops during second of the day.NO is high during early morning and low during afternoons for Chennai and Bengaluru it peaks during early morning and late night. Bengaluru has high NO during winter and prewinter.SO2 varies through out the day, high during winter and low during summer for Chennai and for both Chennai and Bengaluru.SO2 varies through out the day for Chennai and Bengaluru. AQI is high during monsoon and low during summer and peaks during afternoon till midnight for Chennai. For Bengaluru, AQI drops during early morning and peaks in the afternoon, high during spring and low during monsoon.
##5. Understanding the Impact of different pollutants, temperature, season, precipitation on PM2.5: PM2.5 is higher in the order (Highest, Higher, High) seasons of Autumn, Prewinter and Winter. During Monsoons, PM2.5 is generally lowest with low NO,AQI, NOx, and NO2. Higher temperature, precipitation and elevation has associated low PM2.5. ##6. Building different prediction model for PM2.5 using key pollutants, weather info and season for major cities using lm and rpart modeling: Built 22 different models using lm and rpart with different set explanatory variables set (1. NO, 2. NO + Season,3.NO2 + Season, 4.NOx + Season, 5.NO+ NO2 + NOx + Season, 6.NO+ NO2 + NOx + Season + tavg, 7.NO+ NO2 + NOx + Season + tavg + prcp,8. NO+ NO2 + NOx + Season + Elevation, 9.NO+ NO2 + NOx + Season + tavg + prcp + Elevation, 10.NO+ NO2 + NOx + Season + tavg + prcp + Elevation + AQI, 11.AQI). The model is evaluated using both training data set and test set. The test data was prepared by combining cleaned city_day, wethear and geo info of major cities. MSE error for each of the model prediction calculated and stored for comparison. lm models were performing better than the rpart model.
##7. Evaluation and comparison of performance of different models and selecting a optimal one: Comparing all the model results and MSE of each of the 22 model trained using LM and RPART, the model trained using explanatory variables NO+ NO2 + NOx + Season + tavg + prcp + Elevation + AQ yield better results. lm model had MSE of 188.852974998694 for test data set and 106.871067357215 for training dataset.
##8.Future Work: The following objectives were not taken due to lack of suitable dataset and these can be considered for future work 1. Not all major pollutants are correlated with PM2.5, further analysis can be performed to build further optimal and cities related parameters like lat, lan, populatio, mobility, solid waste etc. 2. This can analysis can be further extended to see if there could be impact of green cover or tree census growth over the years on PM2.5 as trees play an important role in purifying air. 3. Study can be done to predict cardio vascular incidence in a give city with give pollutants details.